问题:为熊猫MultiIndex设置一个级别

经过分组后,我创建了一个具有MultiIndex的DataFrame:

import numpy as np
import pandas as p
from numpy.random import randn

df = p.DataFrame({
    'A' : ['a1', 'a1', 'a2', 'a3']
  , 'B' : ['b1', 'b2', 'b3', 'b4']
  , 'Vals' : randn(4)
}).groupby(['A', 'B']).sum()

df

Output>            Vals
Output> A  B           
Output> a1 b1 -1.632460
Output>    b2  0.596027
Output> a2 b3 -0.619130
Output> a3 b4 -0.002009

如何在MultiIndex前面添加一个级别,以便将其转换为类似以下内容:

Output>                       Vals
Output> FirstLevel A  B           
Output> Foo        a1 b1 -1.632460
Output>               b2  0.596027
Output>            a2 b3 -0.619130
Output>            a3 b4 -0.002009

I have a DataFrame with a MultiIndex created after some grouping:

import numpy as np
import pandas as p
from numpy.random import randn

df = p.DataFrame({
    'A' : ['a1', 'a1', 'a2', 'a3']
  , 'B' : ['b1', 'b2', 'b3', 'b4']
  , 'Vals' : randn(4)
}).groupby(['A', 'B']).sum()

df

Output>            Vals
Output> A  B           
Output> a1 b1 -1.632460
Output>    b2  0.596027
Output> a2 b3 -0.619130
Output> a3 b4 -0.002009

How do I prepend a level to the MultiIndex so that I turn it into something like:

Output>                       Vals
Output> FirstLevel A  B           
Output> Foo        a1 b1 -1.632460
Output>               b2  0.596027
Output>            a2 b3 -0.619130
Output>            a3 b4 -0.002009

回答 0

一种使用pandas.concat()以下代码一行完成此操作的好方法:

import pandas as pd

pd.concat([df], keys=['Foo'], names=['Firstlevel'])

甚至更短的方法:

pd.concat({'Foo': df}, names=['Firstlevel'])

这可以推广到许多数据框架,请参阅docs

A nice way to do this in one line using pandas.concat():

import pandas as pd

pd.concat([df], keys=['Foo'], names=['Firstlevel'])

An even shorter way:

pd.concat({'Foo': df}, names=['Firstlevel'])

This can be generalized to many data frames, see the docs.


回答 1

您可以首先将其添加为普通列,然后将其附加到当前索引,因此:

df['Firstlevel'] = 'Foo'
df.set_index('Firstlevel', append=True, inplace=True)

并根据需要更改顺序:

df.reorder_levels(['Firstlevel', 'A', 'B'])

结果是:

                      Vals
Firstlevel A  B           
Foo        a1 b1  0.871563
              b2  0.494001
           a2 b3 -0.167811
           a3 b4 -1.353409

You can first add it as a normal column and then append it to the current index, so:

df['Firstlevel'] = 'Foo'
df.set_index('Firstlevel', append=True, inplace=True)

And change the order if needed with:

df.reorder_levels(['Firstlevel', 'A', 'B'])

Which results in:

                      Vals
Firstlevel A  B           
Foo        a1 b1  0.871563
              b2  0.494001
           a2 b3 -0.167811
           a3 b4 -1.353409

回答 2

我认为这是一个更通用的解决方案:

# Convert index to dataframe
old_idx = df.index.to_frame()

# Insert new level at specified location
old_idx.insert(0, 'new_level_name', new_level_values)

# Convert back to MultiIndex
df.index = pandas.MultiIndex.from_frame(old_idx)

与其他答案相比有一些优势:

  • 新级别可以添加到任何位置,而不仅仅是顶部。
  • 它纯粹是对索引的一种操作,不需要像串联技巧那样操作数据。
  • 它不需要添加列作为中间步骤,这可以破坏多级列索引。

I think this is a more general solution:

# Convert index to dataframe
old_idx = df.index.to_frame()

# Insert new level at specified location
old_idx.insert(0, 'new_level_name', new_level_values)

# Convert back to MultiIndex
df.index = pandas.MultiIndex.from_frame(old_idx)

Some advantages over the other answers:

  • The new level can be added at any location, not just the top.
  • It is purely a manipulation on the index and doesn’t require manipulating the data, like the concatenation trick.
  • It doesn’t require adding a column as an intermediate step, which can break multi-level column indexes.

回答 3

我用cxrodgers的答案做了一点功能,该IMHO是最佳的解决方案,因为它完全在索引上工作,而与任何数据帧或序列无关。

我添加了一个修复程序: to_frame()方法将为没有索引级别的索引级别发明新名称。这样,新索引将具有旧索引中不存在的名称。我添加了一些代码来还原此名称更改。

下面是代码,我已经使用了一段时间,它似乎工作正常。如果您发现任何问题或极端情况,我将不得不调整答案。

import pandas as pd

def _handle_insert_loc(loc: int, n: int) -> int:
    """
    Computes the insert index from the right if loc is negative for a given size of n.
    """
    return n + loc + 1 if loc < 0 else loc


def add_index_level(old_index: pd.Index, value: Any, name: str = None, loc: int = 0) -> pd.MultiIndex:
    """
    Expand a (multi)index by adding a level to it.

    :param old_index: The index to expand
    :param name: The name of the new index level
    :param value: Scalar or list-like, the values of the new index level
    :param loc: Where to insert the level in the index, 0 is at the front, negative values count back from the rear end
    :return: A new multi-index with the new level added
    """
    loc = _handle_insert_loc(loc, len(old_index.names))
    old_index_df = old_index.to_frame()
    old_index_df.insert(loc, name, value)
    new_index_names = list(old_index.names)  # sometimes new index level names are invented when converting to a df,
    new_index_names.insert(loc, name)        # here the original names are reconstructed
    new_index = pd.MultiIndex.from_frame(old_index_df, names=new_index_names)
    return new_index

它通过了以下单元测试代码:

import unittest

import numpy as np
import pandas as pd

class TestPandaStuff(unittest.TestCase):

    def test_add_index_level(self):
        df = pd.DataFrame(data=np.random.normal(size=(6, 3)))
        i1 = add_index_level(df.index, "foo")

        # it does not invent new index names where there are missing
        self.assertEqual([None, None], i1.names)

        # the new level values are added
        self.assertTrue(np.all(i1.get_level_values(0) == "foo"))
        self.assertTrue(np.all(i1.get_level_values(1) == df.index))

        # it does not invent new index names where there are missing
        i2 = add_index_level(i1, ["x", "y"]*3, name="xy", loc=2)
        i3 = add_index_level(i2, ["a", "b", "c"]*2, name="abc", loc=-1)
        self.assertEqual([None, None, "xy", "abc"], i3.names)

        # the new level values are added
        self.assertTrue(np.all(i3.get_level_values(0) == "foo"))
        self.assertTrue(np.all(i3.get_level_values(1) == df.index))
        self.assertTrue(np.all(i3.get_level_values(2) == ["x", "y"]*3))
        self.assertTrue(np.all(i3.get_level_values(3) == ["a", "b", "c"]*2))

        # df.index = i3
        # print()
        # print(df)

I made a little function out of cxrodgers answer, which IMHO is the best solution since it works purely on an index, independent of any data frame or series.

There is one fix I added: the to_frame() method will invent new names for index levels that don’t have one. As such the new index will have names that don’t exist in the old index. I added some code to revert this name-change.

Below is the code, I’ve used it myself for a while and it seems to work fine. If you find any issues or edge cases, I’d be much obliged to adjust my answer.

import pandas as pd

def _handle_insert_loc(loc: int, n: int) -> int:
    """
    Computes the insert index from the right if loc is negative for a given size of n.
    """
    return n + loc + 1 if loc < 0 else loc


def add_index_level(old_index: pd.Index, value: Any, name: str = None, loc: int = 0) -> pd.MultiIndex:
    """
    Expand a (multi)index by adding a level to it.

    :param old_index: The index to expand
    :param name: The name of the new index level
    :param value: Scalar or list-like, the values of the new index level
    :param loc: Where to insert the level in the index, 0 is at the front, negative values count back from the rear end
    :return: A new multi-index with the new level added
    """
    loc = _handle_insert_loc(loc, len(old_index.names))
    old_index_df = old_index.to_frame()
    old_index_df.insert(loc, name, value)
    new_index_names = list(old_index.names)  # sometimes new index level names are invented when converting to a df,
    new_index_names.insert(loc, name)        # here the original names are reconstructed
    new_index = pd.MultiIndex.from_frame(old_index_df, names=new_index_names)
    return new_index

It passed the following unittest code:

import unittest

import numpy as np
import pandas as pd

class TestPandaStuff(unittest.TestCase):

    def test_add_index_level(self):
        df = pd.DataFrame(data=np.random.normal(size=(6, 3)))
        i1 = add_index_level(df.index, "foo")

        # it does not invent new index names where there are missing
        self.assertEqual([None, None], i1.names)

        # the new level values are added
        self.assertTrue(np.all(i1.get_level_values(0) == "foo"))
        self.assertTrue(np.all(i1.get_level_values(1) == df.index))

        # it does not invent new index names where there are missing
        i2 = add_index_level(i1, ["x", "y"]*3, name="xy", loc=2)
        i3 = add_index_level(i2, ["a", "b", "c"]*2, name="abc", loc=-1)
        self.assertEqual([None, None, "xy", "abc"], i3.names)

        # the new level values are added
        self.assertTrue(np.all(i3.get_level_values(0) == "foo"))
        self.assertTrue(np.all(i3.get_level_values(1) == df.index))
        self.assertTrue(np.all(i3.get_level_values(2) == ["x", "y"]*3))
        self.assertTrue(np.all(i3.get_level_values(3) == ["a", "b", "c"]*2))

        # df.index = i3
        # print()
        # print(df)

回答 4

如何使用pandas.MultiIndex.from_tuples从头开始构建它?

df.index = p.MultiIndex.from_tuples(
    [(nl, A, B) for nl, (A, B) in
        zip(['Foo'] * len(df), df.index)],
    names=['FirstLevel', 'A', 'B'])

cxrodger的解决方案类似,这是一种灵活的方法,避免修改数据框的基础数组。

How about building it from scratch with pandas.MultiIndex.from_tuples?

df.index = p.MultiIndex.from_tuples(
    [(nl, A, B) for nl, (A, B) in
        zip(['Foo'] * len(df), df.index)],
    names=['FirstLevel', 'A', 'B'])

Similarly to cxrodger’s solution, this is a flexible method and avoids modifying the underlying array for the dataframe.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。