熊猫合并-如何避免重复的列

Question 1

我正在尝试在两个数据帧之间合并。每个数据帧都有两个索引级别（日期，客户）。在列中，例如，某些列在两者之间匹配（货币，日期）。

按索引合并这些内容的最佳方法是什么，但不要采用两个副本的货币和日期。

每个数据框都是90列，所以我试图避免用手将所有内容写出来。

df:                 currency  adj_date   data_col1 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

df2:                currency  adj_date   data_col2 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

如果我做：

dfNew = merge(df, df2, left_index=True, right_index=True, how='outer')

我懂了

dfNew:              currency_x  adj_date_x   data_col2 ... currency_y adj_date_y
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45             USD         2012-01-03

谢谢！…

Question 2

I am attempting a merge between two data frames. Each data frame has two index levels (date, cusip). In the columns, some columns match between the two (currency, adj date) for example.

What is the best way to merge these by index, but to not take two copies of currency and adj date.

Each data frame is 90 columns, so I am trying to avoid writing everything out by hand.

df:                 currency  adj_date   data_col1 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

df2:                currency  adj_date   data_col2 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

If I do:

dfNew = merge(df, df2, left_index=True, right_index=True, how='outer')

I get

dfNew:              currency_x  adj_date_x   data_col2 ... currency_y adj_date_y
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45             USD         2012-01-03

Thank you! …

Question 3

您可以算出仅在一个DataFrame中的列，并使用它来选择合并中列的子集。

cols_to_use = df2.columns.difference(df.columns)

然后执行合并（请注意，这是一个索引对象，但是它有一个方便的tolist()方法）。

dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')

这将避免合并中的任何列冲突。

Question 4

You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.

cols_to_use = df2.columns.difference(df.columns)

Then perform the merge (note this is an index object but it has a handy tolist() method).

dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')

This will avoid any columns clashing in the merge.

Question 5

我在中使用该suffixes选项.merge()：

dfNew = df.merge(df2, left_index=True, right_index=True,
                 how='outer', suffixes=('', '_y'))
dfNew.drop(dfNew.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)

谢谢@ijoseph

Question 6

I use the suffixes option in .merge():

dfNew = df.merge(df2, left_index=True, right_index=True,
                 how='outer', suffixes=('', '_y'))
dfNew.drop(dfNew.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)

Thanks @ijoseph

Question 7

以@rprog的答案为基础，可以使用负正则表达式将后缀和filter步骤的各个部分组合为一行：

dfNew = df.merge(df2, left_index=True, right_index=True,
             how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')

或使用df.join：

dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")

这里的正则表达式保留了所有不以单词“ DROP”结尾的内容，因此请确保使用未在各列之间出现的后缀。

Question 8

Building on @rprog’s answer, you can combine the various pieces of the suffix & filter step into one line using a negative regex:

dfNew = df.merge(df2, left_index=True, right_index=True,
             how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')

Or using df.join:

dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")

The regex here is keeping anything that does not end with the word “DROP”, so just make sure to use a suffix that doesn’t appear among the columns already.

Question 9

我刚接触Pandas，但是我想实现相同的目的，自动避免使用_x或_y的列名并删除重复的数据。我终于用这个做了回答，这一个从＃1

sales.csv

    城市;州;单位
    门多西诺; CA; 1
    丹佛; CO; 4
    奥斯汀;德克萨斯州; 2

Revenue.csv

    branch_id; city; revenue; state_id
    10;奥斯丁; 100; TX
    20;奥斯丁; 83; TX
    30;奥斯丁; 4; TX
    47;奥斯丁; 200; TX
    20;丹佛; 83; CO
    30;斯普林菲尔德; 4;我

merge.py导入熊猫

def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)


sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')

result = pandas.merge(sales, revenue,  how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')

当执行merge命令时，我将_x后缀替换为空字符串，并且可以删除以结尾的列_y

output.csv

    id; city; state; units; branch_id; revenue; state_id
    0;丹佛; CO; 4; 20; 83; CO
    1; Austin; TX; 2; 10; 100; TX
    2; Austin; TX; 2; 20; 83; TX
    3; Austin; TX; 2; 30; 4; TX
    4; Austin; TX; 2; 47; 200; TX

Question 10

I’m freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answer and this one from Stackoverflow

sales.csv

    city;state;units
    Mendocino;CA;1
    Denver;CO;4
    Austin;TX;2

revenue.csv

    branch_id;city;revenue;state_id
    10;Austin;100;TX
    20;Austin;83;TX
    30;Austin;4;TX
    47;Austin;200;TX
    20;Denver;83;CO
    30;Springfield;4;I

merge.py import pandas

def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)


sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')

result = pandas.merge(sales, revenue,  how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')

When executing the merge command I replace the _x suffix with an empty string and them I can remove columns ending with _y

output.csv

    id;city;state;units;branch_id;revenue;state_id
    0;Denver;CO;4;20;83;CO
    1;Austin;TX;2;10;100;TX
    2;Austin;TX;2;20;83;TX
    3;Austin;TX;2;30;4;TX
    4;Austin;TX;2;47;200;TX

Question 11

可以解决这个问题，但是我编写了一个基本上处理多余列的函数：

def merge_fix_cols(df_company,df_product,uniqueID):
    
    df_merged = pd.merge(df_company,
                         df_product,
                         how='left',left_on=uniqueID,right_on=uniqueID)    
    for col in df_merged:
        if col.endswith('_x'):
            df_merged.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
        elif col.endswith('_y'):
            to_drop = [col for col in df_merged if col.endswith('_y')]
            df_merged.drop(to_drop,axis=1,inplace=True)
        else:
            pass
    return df_merged

似乎可以很好地与我的合并！

Question 12

This is a bit of going around the problem, but I have written a function that basically deals with the extra columns:

def merge_fix_cols(df_company,df_product,uniqueID):
    
    df_merged = pd.merge(df_company,
                         df_product,
                         how='left',left_on=uniqueID,right_on=uniqueID)    
    for col in df_merged:
        if col.endswith('_x'):
            df_merged.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
        elif col.endswith('_y'):
            to_drop = [col for col in df_merged if col.endswith('_y')]
            df_merged.drop(to_drop,axis=1,inplace=True)
        else:
            pass
    return df_merged

Seems to work well with my merges!

熊猫合并-如何避免重复的列

问题：熊猫合并-如何避免重复的列

回答 0

回答 1

回答 2

回答 3

回答 4

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

守护程序线程说明

如何在Windows中同时安装Python 2.x和Python 3.x

如何在Python中“测试” NoneType？

在Python列表中删除重复的字典

什么是。在Python中的import语句中是什么意思？

将Sphinx与Markdown而不是RST一起使用

熊猫合并-如何避免重复的列

问题：熊猫合并-如何避免重复的列

回答 0

回答 1

回答 2

回答 3

回答 4

相关文章

排行榜展示

文章展示