问题:熊猫获取不在其他数据框中的行
我有两个大熊猫数据框,它们有一些共同点。
假设dataframe2是dataframe1的子集。
如何获取dataframe1中不在dataframe2中的行?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
I’ve two pandas data frames which have some rows in common.
Suppose dataframe2 is a subset of dataframe1.
How can I get the rows of dataframe1 which are not in dataframe2?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
回答 0
一种方法是存储两个dfs内部合并的结果,然后我们可以简单地选择当一列的值不在此通用值中时的行:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
编辑
您发现的另一种方法是使用isin
它将产生NaN
可删除的行:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
但是,如果df2不能以相同的方式开始行,那么它将行不通:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
将产生整个df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column’s values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another method as you’ve found is to use isin
which will produce NaN
rows which you can drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
However if df2 does not start rows in the same manner then this won’t work:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will produce the entire df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
回答 1
当前选择的解决方案产生不正确的结果。为了正确解决此问题,我们可以执行从df1
到的左联接df2
,确保首先仅获得的唯一行df2
。
首先,我们需要修改原始DataFrame以添加包含数据的行[3,10]。
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
执行左联接,消除中的重复项,df2
以便df1
联接的每一行都恰好有1行df2
。使用参数indicator
返回额外的一列,指示该行来自哪个表。
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
创建一个布尔条件:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
为什么其他解决方案是错误的
一些解决方案会犯同样的错误-他们仅检查每个值在每一列中是否独立,而不是在同一行中。添加最后一行,这是唯一的,但具有两列中的值,则会显示df2
以下错误:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
此解决方案得到相同的错误结果:
df1.isin(df2.to_dict('l')).all(1)
The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1
to df2
, making sure to first get just the unique rows for df2
.
First, we need to modify the original DataFrame to add the row with data [3, 10].
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, eliminating duplicates in df2
so that each row of df1
joins with exactly 1 row of df2
. Use the parameter indicator
to return an extra column indicating which table the row was from.
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Create a boolean condition:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
Why other solutions are wrong
A few solutions make the same mistake – they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2
exposes the mistake:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution gets the same wrong result:
df1.isin(df2.to_dict('l')).all(1)
回答 2
假设索引在数据框中是一致的(不考虑实际col值):
df1[~df1.index.isin(df2.index)]
Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):
df1[~df1.index.isin(df2.index)]
回答 3
如前所述,isin要求列和索引必须相同才能匹配。如果匹配仅在行内容上,则获得用于过滤存在的行的掩码的一种方法是将行转换为(Multi)Index:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
如果应考虑索引,则set_index具有关键字参数append,以将列附加到现有索引。如果列未对齐,则可以用列规范替换list(df.columns)以对齐数据。
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
可以替代地使用它来创建索引,尽管我怀疑这样做效率更高。
As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
could alternatively be used to create the indices, though I doubt this is more efficient.
回答 4
假设您有两个具有多个字段(column_names)的数据帧df_1和df_2,并且您要基于某些字段(例如,fields_x,fields_y)查找df_1中唯一不在df_2中的那些条目,请执行以下步骤。
步骤1.将列key1和key2分别添加到df_1和df_2。
步骤2合并数据框,如下所示。field_x和field_y是我们想要的列。
第三步:仅从df_1中选择key1不等于key2的那些行。
步骤4放下key1和key2。
这种方法将解决您的问题,即使使用大数据集也可以快速运行。我已经尝试将其用于具有超过1,000,000行的数据帧。
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.
Step1.Add a column key1 and key2 to df_1 and df_2 respectively.
Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.
Step3.Select only those rows from df_1 where key1 is not equal to key2.
Step4.Drop key1 and key2.
This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
回答 5
回答 6
您可以使用isin(dict)方法进行操作:
In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
col1 col2
3 4 13
4 5 14
说明:
In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
col1 col2
0 True True
1 True True
2 True True
3 False False
4 False False
In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
you can do it using isin(dict) method:
In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
col1 col2
3 4 13
4 5 14
Explanation:
In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
col1 col2
0 True True
1 True True
2 True True
3 False False
4 False False
In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
回答 7
您也可以Concat的df1
,df2
:
x = pd.concat([df1, df2])
然后删除所有重复项:
y = x.drop_duplicates(keep=False, inplace=False)
You can also concat df1
, df2
:
x = pd.concat([df1, df2])
and then remove all duplicates:
y = x.drop_duplicates(keep=False, inplace=False)
回答 8
这个怎么样:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5],
'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
How about this:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5],
'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
回答 9
这是解决此问题的另一种方法:
df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
要么:
df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
Here is another way of solving this:
df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
Or:
df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
回答 10
我这样做的方法涉及添加一个数据框唯一的新列,并使用此列选择是否保留条目
df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)
这使得df1中的每个条目都有一个代码-如果它对于df1是唯一的,则为0;如果在两个dataFrames中都是唯一的,则为1。然后,您可以使用它来限制所需的内容
answer = nonuni[nonuni['Empt'] == 0]
My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry
df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)
This makes it so every entry in df1 has a code – 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want
answer = nonuni[nonuni['Empt'] == 0]
回答 11
使用合并功能提取不相似的行
df = df.merge(same.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
将不同的行保存为CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')
extract the dissimilar rows using the merge function
df = df.merge(same.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
save the dissimilar rows in CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')