问题:在熊猫MultiIndex DataFrame中选择行

选择/过滤索引为MultiIndex数据框的行的最常见的熊猫方法是什么

  • 根据单个值/标签切片
  • 根据一个或多个级别的多个标签进行切片
  • 过滤布尔条件和表达式
  • 哪种方法在什么情况下适用

为简单起见的假设:

  1. 输入数据框没有重复的索引键
  2. 下面的输入数据框只有两个级别。(此处显示的大多数解决方案一般都适用于N级)

输入示例:

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

问题1:选择单个项目

如何选择在“一个”级别中具有“ a”的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

另外,我如何在输出中将级别“一”降低?

     col
two     
t      0
u      1
v      2
w      3

问题1b
如何在级别“ two”上切片所有值为“ t”的行?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

问题2:在一个级别中选择多个值

如何在级别“ one”中选择与项目“ b”和“ d”相对应的行?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

问题2b
我如何获得与“二”级中的“ t”和“ w”相对应的所有值?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

问题3:切片单个横截面 (x, y)

如何检索横截面,即具有从中为索引指定值的单行df?具体来说,我如何检索的横截面('c', 'u'),由

         col
one two     
c   u      9

问题4:切片多个横截面 [(a, b), (c, d), ...]

如何选择与('c', 'u')和相对应的两行('a', 'w')

         col
one two     
c   u      9
a   w      3

问题5:每个级别切成一个项目

如何检索与“一级”中的“ a”或“二级”中的“ t”相对应的所有行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

问题6:任意切片

如何切特定的横截面?对于“ a”和“ b”,我想选择子级别为“ u”和“ v”的所有行,对于“ d”,我想选择子级别为“ w”的行。

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

问题7将使用由数字级别组成的唯一设置:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

问题7:按数字不等式对多索引的各个级别进行过滤

如何获得“二级”中的值大于5的所有行?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

注意:本文将介绍如何创建MultiIndexes,如何对其执行赋值操作或任何与性能相关的讨论(这些是下次的单独主题)。

What are the most common pandas ways to select/filter rows of a dataframe whose index is a MultiIndex?

  • Slicing based on a single value/label
  • Slicing based on multiple labels from one or more levels
  • Filtering on boolean conditions and expressions
  • Which methods are applicable in what circumstances

Assumptions for simplicity:

  1. input dataframe does not have duplicate index keys
  2. input dataframe below only has two levels. (Most solutions shown here generalize to N levels)

Example input:

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 1: Selecting a Single Item

How do I select rows having “a” in level “one”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

Additionally, how would I be able to drop level “one” in the output?

     col
two     
t      0
u      1
v      2
w      3

Question 1b
How do I slice all rows with value “t” on level “two”?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Question 2: Selecting Multiple Values in a Level

How can I select rows corresponding to items “b” and “d” in level “one”?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 2b
How would I get all values corresponding to “t” and “w” in level “two”?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

Question 3: Slicing a Single Cross Section (x, y)

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

Question 5: One Item Sliced per Level

How can I retrieve all rows corresponding to “a” in level “one” or “t” in level “two”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

Question 6: Arbitrary Slicing

How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and for “d”, I would like to select rows with sub-level “w”.

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

Question 7 will use a unique setup consisting of a numeric level:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

Question 7: Filtering by numeric inequality on individual levels of the multiindex

How do I get all rows where values in level “two” are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

Note: This post will not go through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).


回答 0

多索引/高级索引

注意:
此帖子的结构如下:

  1. 操作规范中提出的问题将一一解决
  2. 对于每个问题,将演示一种或多种适用于解决该问题并获得预期结果的方法。

注意 s(非常类似于此内容)将被提供给有兴趣学习更多功能,实现细节和其他手头主题信息的读者。这些注释是通过搜集文档并发现各种晦涩难懂的功能,以及从我自己的(公认的有限)经验中编写的。

所有代码示例均已在pandas v0.23.4,python3.7上创建并测试。如果有不清楚的地方,或者实际上是不正确的,或者您找不到适用于您的用例的解决方案,请随时提出修改建议,在注释中要求澄清,或打开一个新问题…… 。

这是对一些常见习语(以下称为“四个习语”)的介绍,我们将经常对其进行复习。

  1. DataFrame.loc-按标签选择的一般解决方案(+ pd.IndexSlice表示涉及切片的更复杂应用)

  2. DataFrame.xs -从Series / DataFrame中提取特定的横截面。

  3. DataFrame.query-动态指定切片和/或过滤操作(即,作为动态评估的表达式。比其他情况更适用于某些方案。另请参阅文档的此部分以查询MultiIndexes。

  4. 使用生成的掩码进行布尔索引MultiIndex.get_level_values(通常与结合使用Index.isin,尤其是在使用多个值进行过滤时)。在某些情况下这也很有用。

从四个习语的角度来看各种切片和过滤问题,将有助于更好地理解可以应用于给定情况的内容,这将是有益的。非常重要的一点是要了解,并非所有习惯用法在每种情况下都一样有效(如果有的话)。如果没有将成语列为以下问题的潜在解决方案,则意味着该成语不能有效地应用于该问题。


问题1

如何选择在“一个”级别中具有“ a”的行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

您可以将loc用作适用于大多数情况的通用解决方案:

df.loc[['a']]

此时,如果您得到

TypeError: Expected tuple, got str

这意味着您使用的是旧版熊猫。考虑升级!否则,请使用df.loc[('a', slice(None)), :]

或者,您可以xs在这里使用,因为我们要提取单个横截面。请注意levelsaxis参数(此处可以采用合理的默认值)。

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)

在这里,drop_level=False需要使用参数来防止xs在结果(我们切入的水平)上降低“一级”。

这里的另一个选择是使用query

df.query("one == 'a'")

如果索引没有名称,则需要将查询字符串更改为"ilevel_0 == 'a'"

最后,使用get_level_values

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']

另外,我如何在输出中将级别“一”降低?

     col
two     
t      0
u      1
v      2
w      3

使用以下任一方法均可轻松完成此操作

df.loc['a'] # Notice the single string argument instead the list.

要么,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')

注意,我们可以省略该drop_level参数(True默认情况下假定为该参数)。

注意
您可能会注意到,经过过滤的DataFrame可能仍然具有所有级别,即使在打印出DataFrame时不显示它们也是如此。例如,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

您可以使用以下方法消除这些级别MultiIndex.remove_unused_levels

v.index = v.index.remove_unused_levels()

print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

问题1b

如何在级别“ two”上切片所有值为“ t”的行?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

直观地讲,您需要包含以下内容slice()

df.loc[(slice(None), 't'), :]

它就是行得通!™,但是笨拙。我们可以在pd.IndexSlice此处使用API 促进更自然的切片语法。

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]

这要干净得多。

注意
为什么:需要跨列的尾随切片?这是因为loc可以用于沿两个轴(axis=0axis=1)进行选择和切片。没有明确说明要在哪个轴上进行切片,操作将变得模棱两可。请参阅切片文档中的红色大框。

如果要消除任何歧义,请loc接受一个axis 参数:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

如果没有该axis参数(即仅执行df.loc[pd.IndexSlice[:, 't']]),则假定切片在列上,并且KeyError在这种情况下将引发a 。

切片器中对此进行了记录。但是,出于本文的目的,我们将明确指定所有轴。

xs,它是

df.xs('t', axis=0, level=1, drop_level=False)

query,它是

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'") 

最后,通过get_level_values,您可以

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']

全部达到相同的效果。


问题2

如何在级别“ one”中选择与项目“ b”和“ d”相对应的行?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

使用loc,可以通过指定列表以类似方式完成此操作。

df.loc[['b', 'd']]

要解决选择“ b”和“ d”的上述问题,您还可以使用query

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')

注意:
是的,默认解析器为'pandas',但重要的是要突出此语法不是python。Pandas解析器从表达式生成的解析树略有不同。这样做是为了使某些操作更直观地指定。有关更多信息,请阅读我关于 使用pd.eval()在熊猫中进行动态表达评估的文章

并且,用get_level_values+ Index.isin

df[df.index.get_level_values("one").isin(['b', 'd'])]

问题2b

我如何获得与“第二”级别中的“ t”和“ w”相对应的所有值?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

使用loc只有与结合使用才有可能pd.IndexSlice

df.loc[pd.IndexSlice[:, ['t', 'w']], :] 

中的第一个冒号:pd.IndexSlice[:, ['t', 'w']]指跨越第一级。随着要查询的级别的深度增加,您将需要指定更多的切片,每个级别将切片一个。您不需要指定多层次超越但被切片的一个。

使用query,这是

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')

get_level_valuesIndex.isin(类似于上面):

df[df.index.get_level_values('two').isin(['t', 'w'])]

问题3

如何检索横截面,即具有从中为索引指定值的单行df?具体来说,我如何检索的横截面('c', 'u'),由

         col
one two     
c   u      9

loc通过指定键元组来使用:

df.loc[('c', 'u'), :]

要么,

df.loc[pd.IndexSlice[('c', 'u')]]

注意
此时,您可能会遇到PerformanceWarning如下所示的:

PerformanceWarning: indexing past lexsort depth may impact performance.

这仅表示您的索引未排序。大熊猫取决于要进行最佳搜索和检索的索引(在这种情况下,按字典顺序排序,因为我们正在处理字符串值)。一种快速的解决方法是使用预先对DataFrame进行排序DataFrame.sort_index。如果您计划一并执行多个此类查询,那么从性能角度来看,这是特别理想的:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]

您还可以用来检查索引是否已排序。此函数返回TrueFalse相应地。您可以调用此函数来确定是否需要其他排序步骤。

使用xs,这再次简单地传递了一个元组作为第一个参数,而所有其他参数都设置为其适当的默认值:

df.xs(('c', 'u'))

使用query,事情变得有些笨拙:

df.query("one == 'c' and two == 'u'")

您现在可以看到,这将很难一概而论。但是对于这个特定问题仍然可以。

跨多个级别的访问get_level_values仍然可以使用,但不建议这样做:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]

问题4

如何选择与('c', 'u')和相对应的两行('a', 'w')

         col
one two     
c   u      9
a   w      3

使用loc,这仍然很简单:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

使用query,您将需要通过遍历横截面和层次来动态生成查询字符串:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)

100%不推荐!但是有可能。


问题5

如何检索与“一级”中的“ a”或“二级”中的“ t”相对应的所有行?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

loc在确保正确性仍保持代码清晰的同时,这实际上很难做到。df.loc[pd.IndexSlice['a', 't']]不正确,将其解释为df.loc[pd.IndexSlice[('a', 't')]](即选择横截面)。您可能会想到一种解决方案,pd.concat可以单独处理每个标签:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12

但是您会注意到其中一行是重复的。这是因为该行同时满足两个切片条件,因此出现了两次。您将需要做

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]

但是,如果您的DataFrame固有地包含重复的索引(所需),那么它将不会保留它们。 使用时要格外小心

使用query,这非常简单:

df.query("one == 'a' or two == 't'")

使用get_level_values,这仍然很简单,但并不那么优雅:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2] 

问题6

如何切特定的横截面?对于“ a”和“ b”,我想选择子级别为“ u”和“ v”的所有行,对于“ d”,我想选择子级别为“ w”的行。

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

我添加了这是一个特殊情况,以帮助理解四个惯用语的用法-这是其中的任何一个都无法有效工作的情况,因为切片非常具体,并且没有遵循任何实际模式。

通常,像这样的切片问题将需要将键列表显式传递给loc。一种方法是:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]

如果要保存一些键入内容,您将认识到存在一种切片“ a”,“ b”及其子级别的模式,因此我们可以将切片任务分为两部分和concat结果:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)

“ a”和“ b”的切片规范稍微清晰一些,(('a', 'b'), ('u', 'v'))因为被索引的相同子级别对于每个级别都是相同的。


问题7

如何获得“二级”中的值大于5的所有行?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

这是可以做到用query

df2.query("two > 5")

get_level_values

df2[df2.index.get_level_values('two') > 5]

注意
类似于此示例,我们可以使用这些构造基于任意条件进行过滤。在一般情况下,要记住,这是非常有用的loc,并xs是专为基于标签的索引,而queryget_level_values是构建一般条件口罩用于过滤很有帮助。


奖金问题

如果我需要对MultiIndex 进行切片怎么办?

实际上,此处的大多数解决方案也适用于色谱柱,只需稍作更改即可。考虑:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7

这些是您需要对四个习惯用法进行的以下更改,才能使它们与列一起使用。

  1. 要切片loc,请使用

    df3.loc[:, ....] # Notice how we slice across the index with `:`. 

    要么,

    df3.loc[:, pd.IndexSlice[...]]
  2. xs适当使用,只需传递一个参数axis=1

  3. 您可以使用直接访问列级别值df.columns.get_level_values。然后,您需要做类似的事情

    df.loc[:, {condition}] 

    其中{condition}代表使用建立的某些条件columns.get_level_values

  4. 要使用query,您唯一的选择是转置,查询索引并再次转置:

    df3.T.query(...).T

    不建议使用其他3个选项之一。

MultiIndex / Advanced Indexing

Note
This post will be structured in the following manner:

  1. The questions put forth in the OP will be addressed, one by one
  2. For each question, one or more methods applicable to solving this problem and getting the expected result will be demonstrated.

Notes (much like this one) will be included for readers interested in learning about additional functionality, implementation details, and other info cursory to the topic at hand. These notes have been compiled through scouring the docs and uncovering various obscure features, and from my own (admittedly limited) experience.

All code samples have created and tested on pandas v0.23.4, python3.7. If something is not clear, or factually incorrect, or if you did not find a solution applicable to your use case, please feel free to suggest an edit, request clarification in the comments, or open a new question, ….as applicable.

Here is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting

  1. DataFrame.loc – A general solution for selection by label (+ pd.IndexSlice for more complex applications involving slices)

  2. DataFrame.xs – Extract a particular cross section from a Series/DataFrame.

  3. DataFrame.query – Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docs for querying on MultiIndexes.

  4. Boolean indexing with a mask generated using MultiIndex.get_level_values (often in conjunction with Index.isin, especially when filtering with multiple values). This is also quite useful in some circumstances.

It will be beneficial to look at the various slicing and filtering problems in terms of the Four Idioms to gain a better understanding what can be applied to a given situation. It is very important to understand that not all of the idioms will work equally well (if at all) in every circumstance. If an idiom has not been listed as a potential solution to a problem below, that means that idiom cannot be applied to that problem effectively.


Question 1

How do I select rows having “a” in level “one”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

You can use loc, as a general purpose solution applicable to most situations:

df.loc[['a']]

At this point, if you get

TypeError: Expected tuple, got str

That means you’re using an older version of pandas. Consider upgrading! Otherwise, use df.loc[('a', slice(None)), :].

Alternatively, you can use xs here, since we are extracting a single cross section. Note the levels and axis arguments (reasonable defaults can be assumed here).

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)

Here, the drop_level=False argument is needed to prevent xs from dropping level “one” in the result (the level we sliced on).

Yet another option here is using query:

df.query("one == 'a'")

If the index did not have a name, you would need to change your query string to be "ilevel_0 == 'a'".

Finally, using get_level_values:

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']

Additionally, how would I be able to drop level “one” in the output?

     col
two     
t      0
u      1
v      2
w      3

This can be easily done using either

df.loc['a'] # Notice the single string argument instead the list.

Or,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')

Notice that we can omit the drop_level argument (it is assumed to be True by default).

Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

You can get rid of these levels using MultiIndex.remove_unused_levels:

v.index = v.index.remove_unused_levels()
print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

Question 1b

How do I slice all rows with value “t” on level “two”?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Intuitively, you would want something involving slice():

df.loc[(slice(None), 't'), :]

It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSlice API here.

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]

This is much, much cleaner.

Note
Why is the trailing slice : across the columns required? This is because, loc can be used to select and slice along both axes (axis=0 or axis=1). Without explicitly making it clear which axis the slicing is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.

If you want to remove any shade of ambiguity, loc accepts an axis parameter:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

Without the axis parameter (i.e., just by doing df.loc[pd.IndexSlice[:, 't']]), slicing is assumed to be on the columns, and a KeyError will be raised in this circumstance.

This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.

With xs, it is

df.xs('t', axis=0, level=1, drop_level=False)

With query, it is

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'") 

And finally, with get_level_values, you may do

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']

All to the same effect.


Question 2

How can I select rows corresponding to items “b” and “d” in level “one”?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Using loc, this is done in a similar fashion by specifying a list.

df.loc[['b', 'd']]

To solve the above problem of selecting “b” and “d”, you can also use query:

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')

Note
Yes, the default parser is 'pandas', but it is important to highlight this syntax isn’t conventionally python. The Pandas parser generates a slightly different parse tree from the expression. This is done to make some operations more intuitive to specify. For more information, please read my post on Dynamic Expression Evaluation in pandas using pd.eval().

And, with get_level_values + Index.isin:

df[df.index.get_level_values("one").isin(['b', 'd'])]

Question 2b

How would I get all values corresponding to “t” and “w” in level “two”?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

With loc, this is possible only in conjuction with pd.IndexSlice.

df.loc[pd.IndexSlice[:, ['t', 'w']], :] 

The first colon : in pd.IndexSlice[:, ['t', 'w']] means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.

With query, this is

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')

With get_level_values and Index.isin (similar to above):

df[df.index.get_level_values('two').isin(['t', 'w'])]

Question 3

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

Use loc by specifying a tuple of keys:

df.loc[('c', 'u'), :]

Or,

df.loc[pd.IndexSlice[('c', 'u')]]

Note
At this point, you may run into a PerformanceWarning that looks like this:

PerformanceWarning: indexing past lexsort depth may impact performance.

This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]

You can also use to check whether the index is sorted or not. This function returns True or False accordingly. You can call this function to determine whether an additional sorting step is required or not.

With xs, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:

df.xs(('c', 'u'))

With query, things become a bit clunky:

df.query("one == 'c' and two == 'u'")

You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.

With accesses spanning multiple levels, get_level_values can still be used, but is not recommended:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]

Question 4

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

With loc, this is still as simple as:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

With query, you will need to dynamically generate a query string by iterating over your cross sections and levels:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)

100% DO NOT RECOMMEND! But it is possible.

What if I have multiple levels?
One option in this scenario would be to use droplevel to drop the levels you’re not checking, then use isin to test membership, and then boolean index on the final result.

df[df.index.droplevel(unused_level).isin([('c', 'u'), ('a', 'w')])]

Question 5

How can I retrieve all rows corresponding to “a” in level “one” or “t” in level “two”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

This is actually very difficult to do with loc while ensuring correctness and still maintaining code clarity. df.loc[pd.IndexSlice['a', 't']] is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]] (i.e., selecting a cross section). You may think of a solution with pd.concat to handle each label separately:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12

But you’ll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]

But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.

With query, this is stupidly simple:

df.query("one == 'a' or two == 't'")

With get_level_values, this is still simple, but not as elegant:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2] 

Question 6

How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and for “d”, I would like to select rows with sub-level “w”.

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

This is a special case that I’ve added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.

Usually, slicing problems like this will require explicitly passing a list of keys to loc. One way of doing this is with:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]

If you want to save some typing, you will recognise that there is a pattern to slicing “a”, “b” and its sublevels, so we can separate the slicing task into two portions and concat the result:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)

Slicing specification for “a” and “b” is slightly cleaner (('a', 'b'), ('u', 'v')) because the same sub-levels being indexed are the same for each level.


Question 7

How do I get all rows where values in level “two” are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

This can be done using query,

df2.query("two > 5")

And get_level_values.

df2[df2.index.get_level_values('two') > 5]

Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember that loc and xs are specifically for label-based indexing, while query and get_level_values are helpful for building general conditional masks for filtering.


Bonus Question

What if I need to slice a MultiIndex column?

Actually, most solutions here are applicable to columns as well, with minor changes. Consider:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7

These are the following changes you will need to make to the Four Idioms to have them working with columns.

  1. To slice with loc, use

     df3.loc[:, ....] # Notice how we slice across the index with `:`. 
    

    or,

     df3.loc[:, pd.IndexSlice[...]]
    
  2. To use xs as appropriate, just pass an argument axis=1.

  3. You can access the column level values directly using df.columns.get_level_values. You will then need to do something like

     df.loc[:, {condition}] 
    

    Where {condition} represents some condition built using columns.get_level_values.

  4. To use query, your only option is to transpose, query on the index, and transpose again:

     df3.T.query(...).T
    

    Not recommended, use one of the other 3 options.


回答 1

最近,我遇到一个用例,其中有一个3级以上的多索引数据框,在其中我无法使上面的任何解决方案产生想要的结果。上面的解决方案很可能确实可以在我的用例中使用,并且我尝试了几种,但是我无法在我有空的时候让它们使用。

我距离专家还很远,但是我偶然发现了上面的综合答案中未列出的解决方案。我不保证解决方案无论如何都是最佳的。

这是获得与上面的问题6稍有不同的结果的不同方法。(以及其他可能的问题)

我特别在寻找:

  1. 一种从一个索引级别选择两个以上的值,从另一个索引级别选择一个值的方法,以及
  2. 在数据帧输出中保留上一操作的索引值的方法。

作为齿轮上的活动扳手(但完全可固定):

  1. 索引未命名。

在下面的玩具数据帧上:

    index = pd.MultiIndex.from_product([['a','b'],
                               ['stock1','stock2','stock3'],
                               ['price','volume','velocity']])

    df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
                      10,11,12,13,14,15,16,17,18], 
                       index)

                        0
    a stock1 price      1
             volume     2
             velocity   3
      stock2 price      4
             volume     5
             velocity   6
      stock3 price      7
             volume     8
             velocity   9
    b stock1 price     10
             volume    11
             velocity  12
      stock2 price     13
             volume    14
             velocity  15
      stock3 price     16
             volume    17
             velocity  18

当然,使用以下作品:

    df.xs(('stock1', 'velocity'), level=(1,2))

        0
    a   3
    b  12

但是我想要一个不同的结果,所以获得该结果的方法是:

   df.iloc[df.index.isin(['stock1'], level=1) & 
           df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
    b stock1 velocity  12

如果我想从一个级别获得两个以上的值,并从另一个级别获得一个(或2个以上)值:

    df.iloc[df.index.isin(['stock1','stock3'], level=1) & 
            df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
      stock3 velocity   9
    b stock1 velocity  12
      stock3 velocity  18

上面的方法可能有点笨拙,但是我发现它满足了我的需求,而且奖金对我来说更易于理解和阅读。

Recently I came across a use case where I had a 3+ level multi-index dataframe in which I couldn’t make any of the solutions above produce the results I was looking for. It’s quite possible that the above solutions do of course work for my use case, and I tried several, however I was unable to get them to work with the time I had available.

I am far from expert, but I stumbled across a solution that was not listed in the comprehensive answers above. I offer no guarantee that the solutions are in any way optimal.

This is a different way to get a slightly different result to Question #6 above. (and likely other questions as well)

Specifically I was looking for:

  1. A way to choose two+ values from one level of the index and a single value from another level of the index, and
  2. A way to leave the index values from the previous operation in the dataframe output.

As a monkey wrench in the gears (however totally fixable):

  1. The indexes were unnamed.

On the toy dataframe below:

    index = pd.MultiIndex.from_product([['a','b'],
                               ['stock1','stock2','stock3'],
                               ['price','volume','velocity']])

    df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
                      10,11,12,13,14,15,16,17,18], 
                       index)

                        0
    a stock1 price      1
             volume     2
             velocity   3
      stock2 price      4
             volume     5
             velocity   6
      stock3 price      7
             volume     8
             velocity   9
    b stock1 price     10
             volume    11
             velocity  12
      stock2 price     13
             volume    14
             velocity  15
      stock3 price     16
             volume    17
             velocity  18

Using the below works, of course:

    df.xs(('stock1', 'velocity'), level=(1,2))

        0
    a   3
    b  12

But I wanted a different result, so my method to get that result was:

   df.iloc[df.index.isin(['stock1'], level=1) & 
           df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
    b stock1 velocity  12

And if I wanted two+ values from one level and a single (or 2+) value from another level:

    df.iloc[df.index.isin(['stock1','stock3'], level=1) & 
            df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
      stock3 velocity   9
    b stock1 velocity  12
      stock3 velocity  18

The above method is probably a bit clunky, however I found it filled my needs and as a bonus was easier for me to understand and read.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。