熊猫-如何展平列中的层次结构索引

问题:熊猫-如何展平列中的层次结构索引

我有一个在轴1(列)中具有层次结构索引的数据框(来自groupby.agg操作):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf       
                                     sum   sum   sum    sum   amax   amin
0  702730  26451  1993      1    1     1     0    12     13  30.92  24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00  24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00   6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04   3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94  10.94

我想将其展平,使其看起来像这样(名称不是关键的-我可以重命名):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf_amax  tmpf_amin   
0  702730  26451  1993      1    1     1     0    12     13  30.92          24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00          24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00          6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04          3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94          10.94

我该怎么做呢?(我已经尝试了很多,无济于事。)

根据建议,这是字典形式的头

{('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}

I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg operation):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf       
                                     sum   sum   sum    sum   amax   amin
0  702730  26451  1993      1    1     1     0    12     13  30.92  24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00  24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00   6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04   3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94  10.94

I want to flatten it, so that it looks like this (names aren’t critical – I could rename):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf_amax  tmpf_amin   
0  702730  26451  1993      1    1     1     0    12     13  30.92          24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00          24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00          6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04          3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94          10.94

How do I do this? (I’ve tried a lot, to no avail.)

Per a suggestion, here is the head in dict form

{('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}

回答 0

我认为最简单的方法是将列设置为顶级:

df.columns = df.columns.get_level_values(0)

注意:如果to级别具有名称,您也可以通过此名称(而不是0)来访问它。

如果要将joinMultiIndex 组合成一个索引(假设您的列中仅包含字符串条目),则可以:

df.columns = [' '.join(col).strip() for col in df.columns.values]

注意:strip没有第二个索引时,必须使用空格。

In [11]: [' '.join(col).strip() for col in df.columns.values]
Out[11]: 
['USAF',
 'WBAN',
 'day',
 'month',
 's_CD sum',
 's_CL sum',
 's_CNT sum',
 's_PC sum',
 'tempf amax',
 'tempf amin',
 'year']

I think the easiest way to do this would be to set the columns to the top level:

df.columns = df.columns.get_level_values(0)

Note: if the to level has a name you can also access it by this, rather than 0.

.

If you want to combine/join your MultiIndex into one Index (assuming you have just string entries in your columns) you could:

df.columns = [' '.join(col).strip() for col in df.columns.values]

Note: we must strip the whitespace for when there is no second index.

In [11]: [' '.join(col).strip() for col in df.columns.values]
Out[11]: 
['USAF',
 'WBAN',
 'day',
 'month',
 's_CD sum',
 's_CL sum',
 's_CNT sum',
 's_PC sum',
 'tempf amax',
 'tempf amin',
 'year']

回答 1

pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only
pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only

回答 2

该线程上的所有当前答案都必须已过时。从pandas0.24.0版开始,.to_flat_index()您需要做什么。

从熊猫自己的文档中

MultiIndex.to_flat_index()

将MultiIndex转换为包含级别值的元组索引。

文档中的一个简单示例:

import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
        [['foo', 'bar'], ['baz', 'qux']],
        names=['a', 'b'])

print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
#           codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
#           names=['a', 'b'])

申请to_flat_index()

index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')

用它来替换现有的pandas

一个如何在上使用它的示例dat,它是一个带有MultiIndex列的DataFrame :

dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
#            codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])

dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'),  ('class_size', 'mean'),
#     ('class_size', 'std'),   ('class_size', 'min'),
#     ('class_size', '25%'),   ('class_size', '50%'),
#     ('class_size', '75%'),   ('class_size', 'max')],
#  dtype='object')

All of the current answers on this thread must have been a bit dated. As of pandas version 0.24.0, the .to_flat_index() does what you need.

From panda’s own documentation:

MultiIndex.to_flat_index()

Convert a MultiIndex to an Index of Tuples containing the level values.

A simple example from its documentation:

import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
        [['foo', 'bar'], ['baz', 'qux']],
        names=['a', 'b'])

print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
#           codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
#           names=['a', 'b'])

Applying to_flat_index():

index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')

Using it to replace existing pandas column

An example of how you’d use it on dat, which is a DataFrame with a MultiIndex column:

dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
#            codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])

dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'),  ('class_size', 'mean'),
#     ('class_size', 'std'),   ('class_size', 'min'),
#     ('class_size', '25%'),   ('class_size', '50%'),
#     ('class_size', '75%'),   ('class_size', 'max')],
#  dtype='object')

回答 3

安迪·海登(Andy Hayden)的答案当然是最简单的方法-如果要避免重复的列标签,则需要进行一些调整

In [34]: df
Out[34]: 
     USAF   WBAN  day  month  s_CD  s_CL  s_CNT  s_PC  tempf         year
                               sum   sum    sum   sum   amax   amin      
0  702730  26451    1      1    12     0     13     1  30.92  24.98  1993
1  702730  26451    2      1    13     0     13     0  32.00  24.98  1993
2  702730  26451    3      1     2    10     13     1  23.00   6.98  1993
3  702730  26451    4      1    12     0     13     1  10.04   3.92  1993
4  702730  26451    5      1    10     0     13     3  19.94  10.94  1993


In [35]: mi = df.columns

In [36]: mi
Out[36]: 
MultiIndex
[(USAF, ), (WBAN, ), (day, ), (month, ), (s_CD, sum), (s_CL, sum), (s_CNT, sum), (s_PC, sum), (tempf, amax), (tempf, amin), (year, )]


In [37]: mi.tolist()
Out[37]: 
[('USAF', ''),
 ('WBAN', ''),
 ('day', ''),
 ('month', ''),
 ('s_CD', 'sum'),
 ('s_CL', 'sum'),
 ('s_CNT', 'sum'),
 ('s_PC', 'sum'),
 ('tempf', 'amax'),
 ('tempf', 'amin'),
 ('year', '')]

In [38]: ind = pd.Index([e[0] + e[1] for e in mi.tolist()])

In [39]: ind
Out[39]: Index([USAF, WBAN, day, month, s_CDsum, s_CLsum, s_CNTsum, s_PCsum, tempfamax, tempfamin, year], dtype=object)

In [40]: df.columns = ind




In [46]: df
Out[46]: 
     USAF   WBAN  day  month  s_CDsum  s_CLsum  s_CNTsum  s_PCsum  tempfamax  tempfamin  \
0  702730  26451    1      1       12        0        13        1      30.92      24.98   
1  702730  26451    2      1       13        0        13        0      32.00      24.98   
2  702730  26451    3      1        2       10        13        1      23.00       6.98   
3  702730  26451    4      1       12        0        13        1      10.04       3.92   
4  702730  26451    5      1       10        0        13        3      19.94      10.94   




   year  
0  1993  
1  1993  
2  1993  
3  1993  
4  1993

Andy Hayden’s answer is certainly the easiest way — if you want to avoid duplicate column labels you need to tweak a bit

In [34]: df
Out[34]: 
     USAF   WBAN  day  month  s_CD  s_CL  s_CNT  s_PC  tempf         year
                               sum   sum    sum   sum   amax   amin      
0  702730  26451    1      1    12     0     13     1  30.92  24.98  1993
1  702730  26451    2      1    13     0     13     0  32.00  24.98  1993
2  702730  26451    3      1     2    10     13     1  23.00   6.98  1993
3  702730  26451    4      1    12     0     13     1  10.04   3.92  1993
4  702730  26451    5      1    10     0     13     3  19.94  10.94  1993


In [35]: mi = df.columns

In [36]: mi
Out[36]: 
MultiIndex
[(USAF, ), (WBAN, ), (day, ), (month, ), (s_CD, sum), (s_CL, sum), (s_CNT, sum), (s_PC, sum), (tempf, amax), (tempf, amin), (year, )]


In [37]: mi.tolist()
Out[37]: 
[('USAF', ''),
 ('WBAN', ''),
 ('day', ''),
 ('month', ''),
 ('s_CD', 'sum'),
 ('s_CL', 'sum'),
 ('s_CNT', 'sum'),
 ('s_PC', 'sum'),
 ('tempf', 'amax'),
 ('tempf', 'amin'),
 ('year', '')]

In [38]: ind = pd.Index([e[0] + e[1] for e in mi.tolist()])

In [39]: ind
Out[39]: Index([USAF, WBAN, day, month, s_CDsum, s_CLsum, s_CNTsum, s_PCsum, tempfamax, tempfamin, year], dtype=object)

In [40]: df.columns = ind




In [46]: df
Out[46]: 
     USAF   WBAN  day  month  s_CDsum  s_CLsum  s_CNTsum  s_PCsum  tempfamax  tempfamin  \
0  702730  26451    1      1       12        0        13        1      30.92      24.98   
1  702730  26451    2      1       13        0        13        0      32.00      24.98   
2  702730  26451    3      1        2       10        13        1      23.00       6.98   
3  702730  26451    4      1       12        0        13        1      10.04       3.92   
4  702730  26451    5      1       10        0        13        3      19.94      10.94   




   year  
0  1993  
1  1993  
2  1993  
3  1993  
4  1993

回答 4

df.columns = ['_'.join(tup).rstrip('_') for tup in df.columns.values]
df.columns = ['_'.join(tup).rstrip('_') for tup in df.columns.values]

回答 5

而且,如果您想保留第二级多索引中的任何聚合信息,则可以尝试以下操作:

In [1]: new_cols = [''.join(t) for t in df.columns]
Out[1]:
['USAF',
 'WBAN',
 'day',
 'month',
 's_CDsum',
 's_CLsum',
 's_CNTsum',
 's_PCsum',
 'tempfamax',
 'tempfamin',
 'year']

In [2]: df.columns = new_cols

And if you want to retain any of the aggregation info from the second level of the multiindex you can try this:

In [1]: new_cols = [''.join(t) for t in df.columns]
Out[1]:
['USAF',
 'WBAN',
 'day',
 'month',
 's_CDsum',
 's_CLsum',
 's_CNTsum',
 's_PCsum',
 'tempfamax',
 'tempfamin',
 'year']

In [2]: df.columns = new_cols

回答 6

使用map函数的最pythonic方法。

df.columns = df.columns.map(' '.join).str.strip()

输出print(df.columns)

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

使用Python 3.6+和f字符串进行更新:

df.columns = [f'{f} {s}' if s != '' else f'{f}' 
              for f, s in df.columns]

print(df.columns)

输出:

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

The most pythonic way to do this to use map function.

df.columns = df.columns.map(' '.join).str.strip()

Output print(df.columns):

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

Update using Python 3.6+ with f string:

df.columns = [f'{f} {s}' if s != '' else f'{f}' 
              for f, s in df.columns]

print(df.columns)

Output:

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

回答 7

对我来说,最简单,最直观的解决方案是使用get_level_values组合列名称。当您在同一列上执行多个聚合时,这可以防止重复的列名称:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
df.columns = level_one + level_two

如果要在列之间使用分隔符,则可以执行此操作。这将返回与Seiji Armstrong关于已接受答案的评论相同的内容,该评论仅包括两个索引级别中的值的列的下划线:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
column_separator = ['_' if x != '' else '' for x in level_two]
df.columns = level_one + column_separator + level_two

我知道这与Andy Hayden的出色答案具有相同的作用,但我认为这种方式更直观,并且更容易记住(因此,我不必继续引用此线程),尤其是对于熊猫新手用户。

在您可能具有3个列级别的情况下,此方法也可以扩展。

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
level_three = df.columns.get_level_values(2).astype(str)
df.columns = level_one + level_two + level_three

The easiest and most intuitive solution for me was to combine the column names using get_level_values. This prevents duplicate column names when you do more than one aggregation on the same column:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
df.columns = level_one + level_two

If you want a separator between columns, you can do this. This will return the same thing as Seiji Armstrong’s comment on the accepted answer that only includes underscores for columns with values in both index levels:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
column_separator = ['_' if x != '' else '' for x in level_two]
df.columns = level_one + column_separator + level_two

I know this does the same thing as Andy Hayden’s great answer above, but I think it is a bit more intuitive this way and is easier to remember (so I don’t have to keep referring to this thread), especially for novice pandas users.

This method is also more extensible in the case where you may have 3 column levels.

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
level_three = df.columns.get_level_values(2).astype(str)
df.columns = level_one + level_two + level_three

回答 8

阅读完所有答案后,我想到了:

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

用法:

给定一个数据框:

df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])

  grouper  val1  2
0       x     0  1
1       x     2  3
2       y     4  5
3       y     6  7
  • 单一聚合方法与源名称相同的结果变量:

    df.groupby(by="grouper").agg("min").my_flatten_cols()
    • df.groupby(by="grouper", as_index = False).agg(...).reset_index()相同
    • ----- before -----
                 val1  2
        grouper         
      
      ------ after -----
        grouper  val1  2
      0       x     0  1
      1       y     4  5
  • 单源变量,多个聚合以统计信息命名的结果变量:

    df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last")
    • 与相同a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index()
    • ----- before -----
                  val1    
                 min max
        grouper         
      
      ------ after -----
        grouper  min  max
      0       x    0    2
      1       y    4    6
  • 多个变量,多个聚合:名为(varname)_(statname)的结果变量:

    df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols()
    # you can combine the names in other ways too, e.g. use a different delimiter:
    #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join)
    • 在后台运行a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values](因为这种形式的agg()结果出现MultiIndex在列上)。
    • 如果您没有my_flatten_cols帮助者,则输入@Seigi建议的解决方案可能会更容易:a.columns = ["_".join(t).rstrip("_") for t in a.columns.values]在这种情况下,它的工作原理类似(但如果列上有数字标签,则会失败)
    • 要处理列上的数字标签,可以使用@jxstanford和@Nolan Conawaya.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values])建议的解决方案,但我不明白为什么tuple()需要调用,并且我相信rstrip()只有在某些列具有类似("colname", "")(如果您reset_index()在尝试修复之前会发生这种情况.columns
    • ----- before -----
                 val1           2     
                 min       sum    size
        grouper              
      
      ------ after -----
        grouper  val1_min  2_sum  2_size
      0       x         0      4       2
      1       y         4     12       2
  • 要手动命名结果变量:(这是因为大熊猫0.20.0弃用没有适当的替代性为0.23

    df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"},
                                       2: {"sum_of_2":    "sum", "count_of_2":    "count"}}).my_flatten_cols("last")
    • 其他建议包括:手动设置列:res.columns = ['A_sum', 'B_sum', 'count'].join()输入多个groupby语句。
    • ----- before -----
                         val1                      2         
                count_of_val1 sum_of_val1 count_of_2 sum_of_2
        grouper                                              
      
      ------ after -----
        grouper  count_of_val1  sum_of_val1  count_of_2  sum_of_2
      0       x              2            2           2         4
      1       y              2           10           2        12

助手功能处理的案件

  • 级别名称可以是非字符串,例如,当列名称是整数时按列号使用Index pandas DataFrame,因此我们必须使用map(str, ..)
  • 它们也可以是空的,所以我们必须 filter(None, ..)
  • 对于单级列(即,除MultiIndex之外的任何内容),columns.values返回名称(str,而不是元组)
  • 根据您的使用方式,.agg()您可能需要保留一列的最底端标签或连接多个标签
  • (因为我是熊猫新手?),我希望reset_index()能够以常规方式使用group-by列,因此默认情况下会这样做

After reading through all the answers, I came up with this:

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

Usage:

Given a data frame:

df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])

  grouper  val1  2
0       x     0  1
1       x     2  3
2       y     4  5
3       y     6  7
  • Single aggregation method: resulting variables named the same as source:

    df.groupby(by="grouper").agg("min").my_flatten_cols()
    
    • Same as df.groupby(by="grouper", as_index=False) or .agg(...).reset_index()
    • ----- before -----
                 val1  2
        grouper         
      
      ------ after -----
        grouper  val1  2
      0       x     0  1
      1       y     4  5
      
  • Single source variable, multiple aggregations: resulting variables named after statistics:

    df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last")
    
    • Same as a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index().
    • ----- before -----
                  val1    
                 min max
        grouper         
      
      ------ after -----
        grouper  min  max
      0       x    0    2
      1       y    4    6
      
  • Multiple variables, multiple aggregations: resulting variables named (varname)_(statname):

    df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols()
    # you can combine the names in other ways too, e.g. use a different delimiter:
    #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join)
    
    • Runs a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values] under the hood (since this form of agg() results in MultiIndex on columns).
    • If you don’t have the my_flatten_cols helper, it might be easier to type in the solution suggested by @Seigi: a.columns = ["_".join(t).rstrip("_") for t in a.columns.values], which works similarly in this case (but fails if you have numeric labels on columns)
    • To handle the numeric labels on columns, you could use the solution suggested by @jxstanford and @Nolan Conaway (a.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values]), but I don’t understand why the tuple() call is needed, and I believe rstrip() is only required if some columns have a descriptor like ("colname", "") (which can happen if you reset_index() before trying to fix up .columns)
    • ----- before -----
                 val1           2     
                 min       sum    size
        grouper              
      
      ------ after -----
        grouper  val1_min  2_sum  2_size
      0       x         0      4       2
      1       y         4     12       2
      
  • You want to name the resulting variables manually: (this is deprecated since pandas 0.20.0 with no adequate alternative as of 0.23)

    df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"},
                                       2: {"sum_of_2":    "sum", "count_of_2":    "count"}}).my_flatten_cols("last")
    
    • Other suggestions include: setting the columns manually: res.columns = ['A_sum', 'B_sum', 'count'] or .join()ing multiple groupby statements.
    • ----- before -----
                         val1                      2         
                count_of_val1 sum_of_val1 count_of_2 sum_of_2
        grouper                                              
      
      ------ after -----
        grouper  count_of_val1  sum_of_val1  count_of_2  sum_of_2
      0       x              2            2           2         4
      1       y              2           10           2        12
      

Cases handled by the helper function

  • level names can be non-string, e.g. Index pandas DataFrame by column numbers, when column names are integers, so we have to convert with map(str, ..)
  • they can also be empty, so we have to filter(None, ..)
  • for single-level columns (i.e. anything except MultiIndex), columns.values returns the names (str, not tuples)
  • depending on how you used .agg() you may need to keep the bottom-most label for a column or concatenate multiple labels
  • (since I’m new to pandas?) more often than not, I want reset_index() to be able to work with the group-by columns in the regular way, so it does that by default

回答 9

处理多个级别和混合类型的常规解决方案:

df.columns = ['_'.join(tuple(map(str, t))) for t in df.columns.values]

A general solution that handles multiple levels and mixed types:

df.columns = ['_'.join(tuple(map(str, t))) for t in df.columns.values]

回答 10

也许有些晚,但是如果您不担心重复的列名:

df.columns = df.columns.tolist()

A bit late maybe, but if you are not worried about duplicate column names:

df.columns = df.columns.tolist()

回答 11

如果您希望在各个级别之间使用分隔符,则此功能会很好用。

def flattenHierarchicalCol(col,sep = '_'):
    if not type(col) is tuple:
        return col
    else:
        new_col = ''
        for leveli,level in enumerate(col):
            if not level == '':
                if not leveli == 0:
                    new_col += sep
                new_col += level
        return new_col

df.columns = df.columns.map(flattenHierarchicalCol)

In case you want to have a separator in the name between levels, this function works well.

def flattenHierarchicalCol(col,sep = '_'):
    if not type(col) is tuple:
        return col
    else:
        new_col = ''
        for leveli,level in enumerate(col):
            if not level == '':
                if not leveli == 0:
                    new_col += sep
                new_col += level
        return new_col

df.columns = df.columns.map(flattenHierarchicalCol)

回答 12

在@jxstanford和@ tvt173之后,我编写了一个快速函数,无论字符串/ int列名如何,该函数都可以完成此任务:

def flatten_cols(df):
    df.columns = [
        '_'.join(tuple(map(str, t))).rstrip('_') 
        for t in df.columns.values
        ]
    return df

Following @jxstanford and @tvt173, I wrote a quick function which should do the trick, regardless of string/int column names:

def flatten_cols(df):
    df.columns = [
        '_'.join(tuple(map(str, t))).rstrip('_') 
        for t in df.columns.values
        ]
    return df

回答 13

您也可以按照以下步骤进行操作。考虑df是您的数据框,并假设一个二级索引(在您的示例中就是这种情况)

df.columns = [(df.columns[i][0])+'_'+(datadf_pos4.columns[i][1]) for i in range(len(df.columns))]

You could also do as below. Consider df to be your dataframe and assume a two level index (as is the case in your example)

df.columns = [(df.columns[i][0])+'_'+(datadf_pos4.columns[i][1]) for i in range(len(df.columns))]

回答 14

我将分享一种对我有用的简单方法。

[" ".join([str(elem) for elem in tup]) for tup in df.columns.tolist()]
#df = df.reset_index() if needed

I’ll share a straight-forward way that worked for me.

[" ".join([str(elem) for elem in tup]) for tup in df.columns.tolist()]
#df = df.reset_index() if needed

回答 15

要在其他DataFrame方法链内展平MultiIndex,请定义如下函数:

def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

然后使用该pipe方法在DataFrame方法链中,在链中任何其他方法之后groupbyagg之前应用此函数:

my_df \
  .groupby('group') \
  .agg({'value': ['count']}) \
  .pipe(flatten_index) \
  .sort_values('value_count')

To flatten a MultiIndex inside a chain of other DataFrame methods, define a function like this:

def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

Then use the pipe method to apply this function in the chain of DataFrame methods, after groupby and agg but before any other methods in the chain:

my_df \
  .groupby('group') \
  .agg({'value': ['count']}) \
  .pipe(flatten_index) \
  .sort_values('value_count')

回答 16

另一个简单的例程。

def flatten_columns(df, sep='.'):
    def _remove_empty(column_name):
        return tuple(element for element in column_name if element)
    def _join(column_name):
        return sep.join(column_name)

    new_columns = [_join(_remove_empty(column)) for column in df.columns.values]
    df.columns = new_columns

Another simple routine.

def flatten_columns(df, sep='.'):
    def _remove_empty(column_name):
        return tuple(element for element in column_name if element)
    def _join(column_name):
        return sep.join(column_name)

    new_columns = [_join(_remove_empty(column)) for column in df.columns.values]
    df.columns = new_columns