标签归档:pandas

系列的真值含糊不清。使用a.empty,a.bool(),a.item(),a.any()或a.all()

问题:系列的真值含糊不清。使用a.empty,a.bool(),a.item(),a.any()或a.all()

在用or条件过滤我的结果数据框时出现问题。我希望我的结果df提取var大于0.25且小于-0.25的所有列值。

下面的逻辑为我提供了一个模糊的真实值,但是当我将此过滤分为两个独立的操作时,它可以工作。这是怎么回事 不知道在哪里使用建议a.empty(), a.bool(), a.item(),a.any() or a.all()

 result = result[(result['var']>0.25) or (result['var']<-0.25)]

Having issue filtering my result dataframe with an or condition. I want my result df to extract all column var values that are above 0.25 and below -0.25.

This logic below gives me an ambiguous truth value however it work when I split this filtering in two separate operations. What is happening here? not sure where to use the suggested a.empty(), a.bool(), a.item(),a.any() or a.all().

 result = result[(result['var']>0.25) or (result['var']<-0.25)]

回答 0

orandPython语句需要truth-值。因为pandas这些被认为是模棱两可的,所以您应该使用“按位” |(或)或&(和)操作:

result = result[(result['var']>0.25) | (result['var']<-0.25)]

对于此类数据结构,它们会重载以生成元素级or(或and)。


只是为该语句添加更多解释:

当您想获取的时bool,将引发异常pandas.Series

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

什么你打是一处经营隐含转换的操作数bool(你用or,但它也恰好为andifwhile):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

除了这些4个语句有一些隐藏某几个Python函数bool调用(如anyallfilter,…),这些都是通常不会有问题的pandas.Series,但出于完整性我想提一提这些。


在您的情况下,该异常并不是真正有用的,因为它没有提到正确的替代方法。对于and和,or您可以使用(如果您想要逐元素比较):

  • numpy.logical_or

    >>> import numpy as np
    >>> np.logical_or(x, y)

    或简单地|算:

    >>> x | y
  • numpy.logical_and

    >>> np.logical_and(x, y)

    或简单地&算:

    >>> x & y

如果您使用的是运算符,请确保由于运算符优先级而正确设置了括号。

几个逻辑numpy的功能,应该工作的pandas.Series


如果您在执行if或时遇到异常,则异常中提到的替代方法更适合while。我将在下面简短地解释每个:

  • 如果要检查您的系列是否为

    >>> x = pd.Series([])
    >>> x.empty
    True
    >>> x = pd.Series([1])
    >>> x.empty
    False

    如果没有明确的布尔值解释,Python通常会将len容器的gth(如list,,tuple…)解释为真值。因此,如果您想进行类似python的检查,可以执行:if x.sizeif not x.empty代替if x

  • 如果您Series包含一个且只有一个布尔值:

    >>> x = pd.Series([100])
    >>> (x > 50).bool()
    True
    >>> (x < 50).bool()
    False
  • 如果要检查系列的第一个也是唯一的一项(例如,.bool()但即使不是布尔型内容也可以使用):

    >>> x = pd.Series([100])
    >>> x.item()
    100
  • 如果要检查所有任何项目是否为非零,非空或非False:

    >>> x = pd.Series([0, 1, 2])
    >>> x.all()   # because one element is zero
    False
    >>> x.any()   # because one (or more) elements are non-zero
    True

The or and and python statements require truth-values. For pandas these are considered ambiguous so you should use “bitwise” | (or) or & (and) operations:

result = result[(result['var']>0.25) | (result['var']<-0.25)]

These are overloaded for these kind of datastructures to yield the element-wise or (or and).


Just to add some more explanation to this statement:

The exception is thrown when you want to get the bool of a pandas.Series:

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What you hit was a place where the operator implicitly converted the operands to bool (you used or but it also happens for and, if and while):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Besides these 4 statements there are several python functions that hide some bool calls (like any, all, filter, …) these are normally not problematic with pandas.Series but for completeness I wanted to mention these.


In your case the exception isn’t really helpful, because it doesn’t mention the right alternatives. For and and or you can use (if you want element-wise comparisons):

  • numpy.logical_or:

    >>> import numpy as np
    >>> np.logical_or(x, y)
    

    or simply the | operator:

    >>> x | y
    
  • numpy.logical_and:

    >>> np.logical_and(x, y)
    

    or simply the & operator:

    >>> x & y
    

If you’re using the operators then make sure you set your parenthesis correctly because of the operator precedence.

There are several logical numpy functions which should work on pandas.Series.


The alternatives mentioned in the Exception are more suited if you encountered it when doing if or while. I’ll shortly explain each of these:

  • If you want to check if your Series is empty:

    >>> x = pd.Series([])
    >>> x.empty
    True
    >>> x = pd.Series([1])
    >>> x.empty
    False
    

    Python normally interprets the length of containers (like list, tuple, …) as truth-value if it has no explicit boolean interpretation. So if you want the python-like check, you could do: if x.size or if not x.empty instead of if x.

  • If your Series contains one and only one boolean value:

    >>> x = pd.Series([100])
    >>> (x > 50).bool()
    True
    >>> (x < 50).bool()
    False
    
  • If you want to check the first and only item of your Series (like .bool() but works even for not boolean contents):

    >>> x = pd.Series([100])
    >>> x.item()
    100
    
  • If you want to check if all or any item is not-zero, not-empty or not-False:

    >>> x = pd.Series([0, 1, 2])
    >>> x.all()   # because one element is zero
    False
    >>> x.any()   # because one (or more) elements are non-zero
    True
    

回答 1

对于布尔逻辑,请使用&|

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

要查看发生了什么,您可以为每个比较获得一列布尔值,例如

df.C > 0.25
0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

当您有多个条件时,将返回多个列。这就是为什么联接逻辑模棱两可的原因。单独使用andor对待每列,因此您首先需要将该列减少为单个布尔值。例如,查看每个列中的任何值或所有值是否为True。

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False

一种实现相同目的的复杂方法是​​将所有这些列压缩在一起,并执行适当的逻辑。

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

有关更多详细信息,请参阅文档中的布尔索引

For boolean logic, use & and |.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

To see what is happening, you get a column of booleans for each comparison, e.g.

df.C > 0.25
0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

When you have multiple criteria, you will get multiple columns returned. This is why the the join logic is ambiguous. Using and or or treats each column separately, so you first need to reduce that column to a single boolean value. For example, to see if any value or all values in each of the columns is True.

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False

One convoluted way to achieve the same thing is to zip all of these columns together, and perform the appropriate logic.

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

For more details, refer to Boolean Indexing in the docs.


回答 2

好吧熊猫使用按位’&”|’ 并且每个条件都应该用’()’包装

例如以下作品

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

但是没有适当括号的相同查询不会

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]

Well pandas use bitwise ‘&’ ‘|’ and each condition should be wrapped in a ‘()’

For example following works

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

But the same query without proper brackets does not

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]

回答 3

或者,您也可以使用操作员模块。更详细的信息在这里Python文档

import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438

Or, alternatively, you could use Operator module. More detailed information is here Python docs

import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438

回答 4

这个极好的答案很好地解释了正在发生的事情并提供了解决方案。我想添加另一种可能在类似情况下适用的解决方案:使用query方法:

result = result.query("(var > 0.25) or (var < -0.25)")

另请参见http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query

(一些我正在使用的数据帧的测试表明,该方法比在一系列布尔值上使用按位运算符要慢一些:2 ms vs. 870 µs)

警告:至少其中一种情况不是很简单,那就是列名恰好是python表达式。我有名为的列WT_38hph_IP_2WT_38hph_input_2log2(WT_38hph_IP_2/WT_38hph_input_2)想执行以下查询:"(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"

我获得了以下异常级联:

  • KeyError: 'log2'
  • UndefinedVariableError: name 'log2' is not defined
  • ValueError: "log2" is not a supported function

我猜这是因为查询解析器试图从前两列中获取内容,而不是用第三列的名称来标识表达式。

这里提出一种可能的解决方法。

This excellent answer explains very well what is happening and provides a solution. I would like to add another solution that might be suitable in similar cases: using the query method:

result = result.query("(var > 0.25) or (var < -0.25)")

See also http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query.

(Some tests with a dataframe I’m currently working with suggest that this method is a bit slower than using the bitwise operators on series of booleans: 2 ms vs. 870 µs)

A piece of warning: At least one situation where this is not straightforward is when column names happen to be python expressions. I had columns named WT_38hph_IP_2, WT_38hph_input_2 and log2(WT_38hph_IP_2/WT_38hph_input_2) and wanted to perform the following query: "(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"

I obtained the following exception cascade:

  • KeyError: 'log2'
  • UndefinedVariableError: name 'log2' is not defined
  • ValueError: "log2" is not a supported function

I guess this happened because the query parser was trying to make something from the first two columns instead of identifying the expression with the name of the third column.

A possible workaround is proposed here.


回答 5

我遇到了同样的错误,并在pyspark数据帧中停滞了几天,我能够通过将na值填充为0来成功解决它,因为我正在比较2个字段的整数值。

I encountered the same error and got stalled with a pyspark dataframe for few days, I was able to resolve it successfully by filling na values with 0 since I was comparing integer values from 2 fields.


熊猫合并101

问题:熊猫合并101

  • 如何执行与熊猫(LEFT| RIGHT| FULL)(INNER| OUTER)的联接?
  • 合并后如何为缺失的行添加NaN?
  • 合并后如何去除NaN?
  • 我可以合并索引吗?
  • 与大熊猫交叉交往?
  • 如何合并多个DataFrame?
  • mergejoinconcatupdate?WHO?什么?为什么?!

… 和更多。我已经看到这些重复出现的问题,询问有关熊猫合并功能的各个方面。如今,有关合并及其各种用例的大多数信息都分散在数十个措辞不好,无法搜索的帖子中。这里的目的是整理后代的一些更重要的观点。

本QnA是下一章有关大熊猫习语的有用的用户指南中的下一部分(请参阅有关透视的这篇文章有关串联的这篇文章,我将在以后进行探讨)。

请注意,本文并非旨在代替文档,因此也请阅读!一些示例是从那里获取的。

  • How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?
  • How do I add NaNs for missing rows after merge?
  • How do I get rid of NaNs after merging?
  • Can I merge on the index?
  • Cross join with pandas?
  • How do I merge multiple DataFrames?
  • merge? join? concat? update? Who? What? Why?!

… and more. I’ve seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.


回答 0

这篇文章旨在为读者提供有关SQL风格的与熊猫的合并,使用方法以及何时不使用它的入门。

特别是,这是这篇文章的内容:

  • 基础-联接类型(左,右,外,内)

    • 与不同的列名合并
    • 避免在输出中出现重复的合并键列
  • 在不同条件下与索引合并
    • 有效地使用您的命名索引
    • 合并键作为一个索引,另一个索引
  • 多路合并列和索引(唯一和非唯一)
  • 值得注意的替代品mergejoin

这篇文章不会讲的内容:

  • 与性能相关的讨论和时间安排(目前)。在适当的地方,最引人注目的是提到更好的替代方案。
  • 处理后缀,删除多余的列,重命名输出以及其他特定用例。还有其他(阅读:更好)的帖子可以解决这个问题,所以请弄清楚!

注意
除非另有说明,否则大多数示例在演示各种功能时会默认使用INNER JOIN操作。

此外,可以复制和复制此处的所有DataFrame,以便您可以使用它们。另外,请参阅这篇文章 ,了解如何从剪贴板读取DataFrames。

最后,已使用Google绘图手工绘制了JOIN操作的所有可视表示形式。从这里得到启示。

聊够了,只是告诉我如何使用merge

设定

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

为了简单起见,键列具有相同的名称(目前)。

一个内连接由下式表示

注意:
此规则以及即将发布的附图均遵循以下约定:

  • 蓝色表示合并结果中存在的行
  • 红色表示从结果中排除(即删除)的行
  • 绿色表示缺少的值将在结果中替换为NaN

要执行INNER JOIN,请调用merge左侧的DataFrame,并指定右侧的DataFrame和联接键(至少)作为参数。

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

这仅返回来自leftright共享一个公共密钥的行(在本示例中为“ B”和“ D”)。

LEFT OUTER JOIN,或LEFT JOIN由下式表示

可以通过指定执行how='left'

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

请仔细注意NaN的位置。如果指定how='left',则仅left使用from 的键,而rightNaN替换缺少的数据。

同样,对于RIGHT OUTER JOIN或RIGHT JOIN来说…

…指定how='right'

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

在这里,right使用了from 的键,而缺失的数据left被NaN代替。

最后,对于FULL OUTER JOIN,由

指定how='outer'

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

这将使用两个框架中的关键点,并且会为两个框架中缺少的行插入NaN。

该文档很好地总结了这些各种合并:

在此处输入图片说明

其他联接-左排除,右排除和全排除/ ANTI连接

如果您需要分两个步骤进行LEFT-Exclusive JOINRIGHT-Exclusive JOIN

对于不包括JOIN的LEFT,表示为

首先执行LEFT OUTER JOIN,然后过滤(不包括!)left仅来自于行的行,

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

哪里,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

同样,对于除权利加入之外,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

最后,如果您需要执行合并操作,而该合并操作只保留左侧或右侧的键,而不同时保留两者(IOW,执行ANTI-JOIN),

您可以通过类似的方式进行操作-

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

键列的不同名称

如果键列的名称不同(例如,lefthas keyLeftrighthas keyRight代替),key则必须指定left_onright_on作为参数,而不是on

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

避免在输出中重复键列

keyLeftfrom leftkeyRightfrom 上进行合并时right,如果只希望在输出中使用keyLeftkeyRight(但不同时使用)中的任何一个,则可以从将索引设置为初步步骤开始。

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

将此与命令输出(恰恰是的输出left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner'))进行对比(您会发现keyLeft它丢失了)。您可以根据将哪个帧的索引设置为关键字来找出要保留的列。例如,当执行某些OUTER JOIN操作时,这可能很重要。

仅合并其中一个的单个列 DataFrames

例如,考虑

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

如果只需要合并“ new_val”(不包含任何其他列),通常可以在合并之前仅对列进行子集化:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

如果您要进行左外部联接,则性能更高的解决方案将涉及map

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

如前所述,这类似于但比

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

合并多列

要加入对多列,指定列表on(或left_onright_on,如适用)。

left.merge(right, on=['key1', 'key2'] ...)

或者,如果名称不同,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

其他有用的merge*操作和功能

本节仅介绍最基本的内容,目的只是为了激发您的胃口。更多的例子和案例,看到的文档mergejoin以及concat还有链接的功能规格。


基于索引的* -JOIN(+ index-column merges)

设定

np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

通常,索引合并看起来像这样:

left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

支持索引名称

如果索引被命名,然后v0.23用户还可以指定的级别名称on(或left_onright_on必要的)。

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

合并一个索引,另一个列

可以(并且非常简单)使用一个索引和另一个列进行合并。例如,

left.merge(right, left_on='key1', right_index=True)

反之亦然(right_on=...left_index=True)。

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2

  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

在这种特殊情况下,left为命名了索引,因此您也可以将索引名称与一起使用left_on,如下所示:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

DataFrame.join
除了这些,还有另一个简洁的选择。您可以使用DataFrame.join默认值来联接索引。DataFrame.join默认情况下不做LEFT OUTER JOIN,所以how='inner'在这里是必要的。

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

请注意,我需要指定lsuffixrsuffix参数join,否则会出错:

left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

由于列名相同。如果它们的名称不同,这将不是问题。

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

        leftvalue     value
idxkey                     
B       -0.402655  0.543843
D       -0.524349  0.013135

pd.concat
最后,作为基于索引的联接的替代方法,可以使用pd.concat

pd.concat([left, right], axis=1, sort=False, join='inner')

           value     value
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

省略join='inner'是否需要FULL OUTER JOIN(默认):

pd.concat([left, right], axis=1, sort=False)

      value     value
A -0.602923       NaN
B -0.402655  0.543843
C  0.302329       NaN
D -0.524349  0.013135
E       NaN -0.326498
F       NaN  1.385076

有关更多信息,请参见@piRSquared pd.concat的此规范帖子


通用化:merge处理多个DataFrame

通常,将多个DataFrame合并在一起时会出现这种情况。天真的,这可以通过链接merge调用来完成:

df1.merge(df2, ...).merge(df3, ...)

但是,对于许多DataFrame,这很快就变得一发不可收拾。此外,可能有必要归纳为未知数量的DataFrame。

在这里,我将介绍pd.concat针对唯一DataFrame.join的多方联接,以及针对非唯一键的多方联接。首先,设置。

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

多路合并唯一键(或索引)

如果您的键(此处的键可以是列或索引)是唯一的,则可以使用pd.concat。请注意,pd.concat将DataFrames连接到索引

# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
    df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

省略join='inner'完整的外部联接。请注意,您不能指定LEFT或RIGHT OUTER连接(如果需要这些连接,请使用join,如下所述)。

多路合并重复项

concat速度很快,但也有缺点。它不能处理重复项。

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

在这种情况下,我们可以使用join它,因为它可以处理非唯一键(请注意,join将DataFrames连接到它们的索引上;merge除非另有说明,否则它在后台进行调用并执行LEFT OUTER JOIN)。

# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
    [df.set_index('key') for df in (B, C)], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0

This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.

In particular, here’s what this post will go through:

  • The basics – types of joins (LEFT, RIGHT, OUTER, INNER)

    • merging with different column names
    • avoiding duplicate merge key column in output
  • Merging with index under different conditions
    • effectively using your named index
    • merge key as the index of one and column of another
  • Multiway merges on columns and indexes (unique and non-unique)
  • Notable alternatives to merge and join

What this post will not go through:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

Enough Talk, just show me how to use merge!

Setup

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by

Note
This, along with the forthcoming figures all follow this convention:

  • blue indicates rows that are present in the merge result
  • red indicates rows that are excluded from the result (i.e., removed)
  • green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

This returns only rows from left and right which share a common key (in this example, “B” and “D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how='left'.

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is…

…specify how='right':

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify how='outer'.

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarises these various merges nicely:

enter image description here

Other JOINs – LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

Where,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

And similarly, for a RIGHT-Excluding JOIN,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you’ll notice keyLeft is missing. You can figure out what column to keep based on which frame’s index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

If you are required to merge only “new_val” (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

If you’re doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

Other useful merge* operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.


Index-based *-JOIN (+ index-column merges)

Setup

np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

Typically, a merge on index would look like this:

left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Support for index names

If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Merging on index of one, column(s) of another

It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,

left.merge(right, left_on='key1', right_index=True)

Or vice versa (right_on=... and left_index=True).

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2

  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

In this special case, the index for left is named, so you can also use the index name with left_on, like this:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:

left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Since the column names are the same. This would not be a problem if they were differently named.

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

        leftvalue     value
idxkey                     
B       -0.402655  0.543843
D       -0.524349  0.013135

pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat:

pd.concat([left, right], axis=1, sort=False, join='inner')

           value     value
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Omit join='inner' if you need a FULL OUTER JOIN (the default):

pd.concat([left, right], axis=1, sort=False)

      value     value
A -0.602923       NaN
B -0.402655  0.543843
C  0.302329       NaN
D -0.524349  0.013135
E       NaN -0.326498
F       NaN  1.385076

For more information, see this canonical post on pd.concat by @piRSquared.


Generalizing: mergeing multiple DataFrames

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:

df1.merge(df2, ...).merge(df3, ...)

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.

Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

Multiway merge on unique keys (or index)

If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.

# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
    df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).

Multiway merge on keys with duplicates

concat is fast, but has its shortcomings. It cannot handle duplicates.

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).

# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
    [df.set_index('key') for df in (B, C)], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0

回答 1

一个补充的视觉观pd.concat([df0, df1], kwargs)。请注意,kwarg axis=0axis=1的含义不如df.mean()或直观df.apply(func)


在pd.concat([df0,df1])

A supplemental visual view of pd.concat([df0, df1], kwargs). Notice that, kwarg axis=0 or axis=1 ‘s meaning is not as intuitive as df.mean() or df.apply(func)


on pd.concat([df0, df1])


如何旋转数据框

问题:如何旋转数据框

  • 什么是支点?
  • 我如何枢纽?
  • 这是支点吗?
  • 长格式到宽格式?

我已经看到很多有关数据透视表的问题。即使他们不知道他们在询问数据透视表,通常也是如此。几乎不可能写出涵盖枢纽各个方面的规范问答。

…但是我要去尝试一下。


现有问题和答案的问题在于,问题通常集中在OP难以推广的细微差别上,以便使用许多现有的良好答案。但是,没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)

从我的Google搜索中查找一些示例

  1. 如何在Pandas中透视数据框?
    • 好问题和答案。但是答案只回答了很少的具体问题。
  2. 熊猫数据透视表到数据框
    • 在此问题中,OP与枢轴的输出有关。即列的外观。OP希望它看起来像R。这对熊猫用户不是很有帮助。
  3. 旋转数据框的熊猫,重复的行
    • 另一个不错的问题,但答案集中在一种方法上,即 pd.DataFrame.pivot

因此,每当有人搜索时,pivot他们都会得到零星的结果,这些结果可能不会回答他们的特定问题。


设定

您可能会注意到,我显眼地命名了我的列和相关的列值,以与我将在下面的答案中介绍的方式相对应。

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

问题

  1. 我为什么得到 ValueError: Index contains duplicate entries, cannot reshape

  2. 如何旋转df以使col值成为列,row值成为索引,值的均值val0

    col   col0   col1   col2   col3  col4
    row                                  
    row0  0.77  0.605    NaN  0.860  0.65
    row2  0.13    NaN  0.395  0.500  0.25
    row3   NaN  0.310    NaN  0.545   NaN
    row4   NaN  0.100  0.395  0.760  0.24
  3. 如何旋转df以使col值是列,row值是索引,值的均值val0是和缺少值是0

    col   col0   col1   col2   col3  col4
    row                                  
    row0  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.100  0.395  0.760  0.24
  4. 我可以得到比其他的东西mean,如可能sum

    col   col0  col1  col2  col3  col4
    row                               
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
  5. 我可以一次做多个聚合吗?

           sum                          mean                           
    col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
    row                                                                
    row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
  6. 我可以汇总多个值列吗?

          val0                             val1                          
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row                                                                  
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
  7. 可以细分为多列吗?

    item item0             item1                         item2                   
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row                                                                          
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
  8. 要么

    item      item0             item1                         item2                  
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row                                                                         
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
  9. 我可以汇总列和行一起出现的频率,又称为“交叉表”吗?

    col   col0  col1  col2  col3  col4
    row                               
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
  10. 如何通过仅旋转两列将DataFrame从长转换为宽?鉴于

    np.random.seed([3, 1415])
    df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})        
    df2        
       A   B
    0  a   0
    1  a  11
    2  a   2
    3  a  11
    4  b  10
    5  b  10
    6  b  14
    7  c   7

    预期应该看起来像

          a     b    c
    0   0.0  10.0  7.0
    1  11.0  10.0  NaN
    2   2.0  14.0  NaN
    3  11.0   NaN  NaN
  11. 我如何将多重索引展平为单索引 pivot

       1  2   
       1  1  2        
    a  2  1  1
    b  2  1  0
    c  1  0  0

       1|1  2|1  2|2               
    a    2    1    1
    b    2    1    0
    c    1    0    0
  • What is pivot?
  • How do I pivot?
  • Is this a pivot?
  • Long format to wide format?

I’ve seen a lot of questions that ask about pivot tables. Even if they don’t know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting….

… But I’m going to give it a go.


The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it’s a daunting task)

Look a few examples from my google search

  1. How to pivot a dataframe in Pandas?
    • Good question and answer. But the answer only answers the specific question with little explanation.
  2. pandas pivot table to data frame
    • In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn’t very helpful for pandas users.
  3. pandas pivoting a dataframe, duplicate rows
    • Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot

So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.


Setup

You may notice that I conspicuously named my columns and relevant column values to correspond with how I’m going to pivot in the answers below.

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)

     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

Question(s)

  1. Why do I get ValueError: Index contains duplicate entries, cannot reshape

  2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

    col   col0   col1   col2   col3  col4
    row                                  
    row0  0.77  0.605    NaN  0.860  0.65
    row2  0.13    NaN  0.395  0.500  0.25
    row3   NaN  0.310    NaN  0.545   NaN
    row4   NaN  0.100  0.395  0.760  0.24
    
  3. How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

    col   col0   col1   col2   col3  col4
    row                                  
    row0  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.100  0.395  0.760  0.24
    
  4. Can I get something other than mean, like maybe sum?

    col   col0  col1  col2  col3  col4
    row                               
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
    
  5. Can I do more that one aggregation at a time?

           sum                          mean                           
    col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
    row                                                                
    row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
    
  6. Can I aggregate over multiple value columns?

          val0                             val1                          
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row                                                                  
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  7. Can Subdivide by multiple columns?

    item item0             item1                         item2                   
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row                                                                          
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  8. Or

    item      item0             item1                         item2                  
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row                                                                         
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  9. Can I aggregate the frequency in which the column and rows occur together, aka “cross tabulation”?

    col   col0  col1  col2  col3  col4
    row                               
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    
  10. How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,

    np.random.seed([3, 1415])
    df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})        
    df2        
       A   B
    0  a   0
    1  a  11
    2  a   2
    3  a  11
    4  b  10
    5  b  10
    6  b  14
    7  c   7
    

    The expected should would look something like

          a     b    c
    0   0.0  10.0  7.0
    1  11.0  10.0  NaN
    2   2.0  14.0  NaN
    3  11.0   NaN  NaN
    
  11. How do I flatten the multiple index to single index after pivot

    From

       1  2   
       1  1  2        
    a  2  1  1
    b  2  1  0
    c  1  0  0
    

    To

       1|1  2|1  2|2               
    a    2    1    1
    b    2    1    0
    c    1    0    0
    

回答 0

我们首先回答第一个问题:

问题1

我为什么得到 ValueError: Index contains duplicate entries, cannot reshape

发生这种情况是因为pandas试图为a columnsindex具有重复条目的对象重新编制索引。有多种方法可以执行数据透视。当有人要求重复输入密钥时,其中某些方法不太适合。例如。考虑一下pd.DataFrame.pivot。我知道有重复的条目共享rowcol值:

df.duplicated(['row', 'col']).any()

True

所以当我pivot使用

df.pivot(index='row', columns='col', values='val0')

我收到上述错误。实际上,当我尝试使用以下命令执行相同的任务时,会出现相同的错误:

df.set_index(['row', 'col'])['val0'].unstack()

这是我们可以用来透视的成语列表

  1. pd.DataFrame.groupby + pd.DataFrame.unstack
    • 进行几乎所有类型的数据透视的良好通用方法
    • 您可以指定一组将构成枢轴行级别和列级别的所有列。通过选择要聚合的其余列以及要执行聚合的功能,可以做到这一点。最后,您unstack要在列索引中包含的级别。
  2. pd.DataFrame.pivot_table
    • groupby具有更直观API的美化版本。对于许多人来说,这是首选方法。并且是开发人员的预期方法。
    • 指定行级别,列级别,要聚合的值以及执行聚合的函数。
  3. pd.DataFrame.set_index + pd.DataFrame.unstack
    • 方便和直观的一些(包括我自己)。无法处理重复的分组密钥。
    • groupby范式类似,我们指定最终将成为行或列级别的所有列,并将其设置为索引。然后unstack,我们在各列中选择所需的级别。如果其余索引级别或列级别都不唯一,则此方法将失败。
  4. pd.DataFrame.pivot
    • 非常相似之处set_index在于它共享重复的密钥限制。该API也非常有限。只需要对标量值indexcolumnsvalues
    • pivot_table方法类似,我们选择要在其上旋转的行,列和值。但是,我们无法聚合,并且如果行或列都不唯一,则此方法将失败。
  5. pd.crosstab
    • 这是pivot_table最纯粹的形式的一种专门版本,是执行多个任务的最直观的方法。
  6. pd.factorize + np.bincount
    • 这是一种非常先进的技术,它非常晦涩,但是速度却很快。并非在所有情况下都可以使用它,但是只要您可以使用它并且感觉舒适,就会获得性能上的回报。
  7. pd.get_dummies + pd.DataFrame.dot
    • 我用它来巧妙地执行交叉制表。

例子

对于接下来的每个答案和问题,我将使用进行回答pd.DataFrame.pivot_table。然后,我将提供替代方法来执行相同的任务。

问题3

如何旋转df以使col值是列,row值是索引,值的均值val0是和缺少值是0

  • pd.DataFrame.pivot_table

    • fill_value默认情况下未设置。我倾向于适当地设置它。在这种情况下,我将其设置为0。请注意,我跳过了问题2,因为它与此答案相同,但没有fill_value
    • aggfunc='mean'是默认设置,我无需设置。我将其包括在内是为了明确。

      df.pivot_table(
          values='val0', index='row', columns='col',
          fill_value=0, aggfunc='mean')
      
      col   col0   col1   col2   col3  col4
      row                                  
      row0  0.77  0.605  0.000  0.860  0.65
      row2  0.13  0.000  0.395  0.500  0.25
      row3  0.00  0.310  0.000  0.545  0.00
      row4  0.00  0.100  0.395  0.760  0.24
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='mean').fillna(0)

问题4

我可以得到比其他的东西mean,如可能sum

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc='sum')
    
    col   col0  col1  col2  col3  col4
    row                               
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='sum').fillna(0)

问题5

我可以一次做多个聚合吗?

请注意,for pivot_tablecrosstab我需要传递可调用对象列表。另一方面,groupby.agg能够为有限数量的特殊功能使用字符串。 groupby.agg也会采用我们传递给其他对象的相同的可调用对象,但是利用字符串函数名称通常会更有效,因为可以提高效率。

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc=[np.size, np.mean])
    
         size                      mean                           
    col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
    row                                                           
    row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
    row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
    row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
    row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')

问题6

我可以汇总多个值列吗?

  • pd.DataFrame.pivot_table我们通过了,values=['val0', 'val1']但我们可以完全取消

    df.pivot_table(
        values=['val0', 'val1'], index='row', columns='col',
        fill_value=0, aggfunc='mean')
    
          val0                             val1                          
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row                                                                  
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)

问题7

可以细分为多列吗?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item item0             item1                         item2                   
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row                                                                          
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
  • pd.DataFrame.groupby

    df.groupby(
        ['row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

问题8

可以细分为多列吗?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index=['key', 'row'], columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item      item0             item1                         item2                  
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row                                                                         
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
  • pd.DataFrame.groupby

    df.groupby(
        ['key', 'row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
  • pd.DataFrame.set_index 因为键集对于行和列都是唯一的

    df.set_index(
        ['key', 'row', 'item', 'col']
    )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)

问题9

我可以汇总列和行一起出现的频率,又称为“交叉表”吗?

  • pd.DataFrame.pivot_table

    df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
    
        col   col0  col1  col2  col3  col4
    row                               
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
  • pd.crosstab

    pd.crosstab(df['row'], df['col'])
  • pd.factorize + np.bincount

    # get integer factorization `i` and unique values `r`
    # for column `'row'`
    i, r = pd.factorize(df['row'].values)
    # get integer factorization `j` and unique values `c`
    # for column `'col'`
    j, c = pd.factorize(df['col'].values)
    # `n` will be the number of rows
    # `m` will be the number of columns
    n, m = r.size, c.size
    # `i * m + j` is a clever way of counting the 
    # factorization bins assuming a flat array of length
    # `n * m`.  Which is why we subsequently reshape as `(n, m)`
    b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
    # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
    pd.DataFrame(b, r, c)
    
          col3  col2  col0  col1  col4
    row3     2     0     0     1     0
    row2     1     2     1     0     2
    row0     1     0     1     2     1
    row4     2     2     0     1     1
  • pd.get_dummies

    pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
    
          col0  col1  col2  col3  col4
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1

问题10

如何通过仅旋转两列将DataFrame从长转换为宽?

第一步是为每行分配一个数字-该数字将成为透视结果中该值的行索引。使用GroupBy.cumcount以下命令完成此操作:

df2.insert(0, 'count', df.groupby('A').cumcount())
df2

   count  A   B
0      0  a   0
1      1  a  11
2      2  a   2
3      3  a  11
4      0  b  10
5      1  b  10
6      2  b  14
7      0  c   7

第二步是使用新创建的列作为要调用的索引DataFrame.pivot

df2.pivot(*df)
# df.pivot(index='count', columns='A', values='B')

A         a     b    c
count                 
0       0.0  10.0  7.0
1      11.0  10.0  NaN
2       2.0  14.0  NaN
3      11.0   NaN  NaN

问题11

我如何将多重索引展平为单索引 pivot

如果columns输入object字符串join

df.columns = df.columns.map('|'.join)

其他 format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format) 

We start by answering the first question:

Question 1

Why do I get ValueError: Index contains duplicate entries, cannot reshape

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(['row', 'col']).any()

True

So when I pivot using

df.pivot(index='row', columns='col', values='val0')

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(['row', 'col'])['val0'].unstack()

Here is a list of idioms we can use to pivot

  1. pd.DataFrame.groupby + pd.DataFrame.unstack
    • Good general approach for doing just about any type of pivot
    • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
  2. pd.DataFrame.pivot_table
    • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
    • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
  3. pd.DataFrame.set_index + pd.DataFrame.unstack
    • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
    • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
  4. pd.DataFrame.pivot
    • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
    • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
  5. pd.crosstab
    • This a specialized version of pivot_table and in it’s purest form is the most intuitive way to perform several tasks.
  6. pd.factorize + np.bincount
    • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
  7. pd.get_dummies + pd.DataFrame.dot
    • I use this for cleverly performing cross tabulation.

Examples

What I’m going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I’ll provide alternatives to perform the same task.

Question 3

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

  • pd.DataFrame.pivot_table

    • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it’s the same as this answer without the fill_value
    • aggfunc='mean' is the default and I didn’t have to set it. I included it to be explicit.

      df.pivot_table(
          values='val0', index='row', columns='col',
          fill_value=0, aggfunc='mean')
      
      col   col0   col1   col2   col3  col4
      row                                  
      row0  0.77  0.605  0.000  0.860  0.65
      row2  0.13  0.000  0.395  0.500  0.25
      row3  0.00  0.310  0.000  0.545  0.00
      row4  0.00  0.100  0.395  0.760  0.24
      
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='mean').fillna(0)
    

Question 4

Can I get something other than mean, like maybe sum?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc='sum')
    
    col   col0  col1  col2  col3  col4
    row                               
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='sum').fillna(0)
    

Question 5

Can I do more that one aggregation at a time?

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc=[np.size, np.mean])
    
         size                      mean                           
    col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
    row                                                           
    row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
    row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
    row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
    row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
    

Question 6

Can I aggregate over multiple value columns?

  • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could’ve left that off completely

    df.pivot_table(
        values=['val0', 'val1'], index='row', columns='col',
        fill_value=0, aggfunc='mean')
    
          val0                             val1                          
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row                                                                  
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
    

Question 7

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item item0             item1                         item2                   
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row                                                                          
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  • pd.DataFrame.groupby

    df.groupby(
        ['row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 8

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index=['key', 'row'], columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item      item0             item1                         item2                  
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row                                                                         
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  • pd.DataFrame.groupby

    df.groupby(
        ['key', 'row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    
  • pd.DataFrame.set_index because the set of keys are unique for both rows and columns

    df.set_index(
        ['key', 'row', 'item', 'col']
    )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 9

Can I aggregate the frequency in which the column and rows occur together, aka “cross tabulation”?

  • pd.DataFrame.pivot_table

    df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
    
        col   col0  col1  col2  col3  col4
    row                               
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(df['row'], df['col'])
    
  • pd.factorize + np.bincount

    # get integer factorization `i` and unique values `r`
    # for column `'row'`
    i, r = pd.factorize(df['row'].values)
    # get integer factorization `j` and unique values `c`
    # for column `'col'`
    j, c = pd.factorize(df['col'].values)
    # `n` will be the number of rows
    # `m` will be the number of columns
    n, m = r.size, c.size
    # `i * m + j` is a clever way of counting the 
    # factorization bins assuming a flat array of length
    # `n * m`.  Which is why we subsequently reshape as `(n, m)`
    b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
    # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
    pd.DataFrame(b, r, c)
    
          col3  col2  col0  col1  col4
    row3     2     0     0     1     0
    row2     1     2     1     0     2
    row0     1     0     1     2     1
    row4     2     2     0     1     1
    
  • pd.get_dummies

    pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
    
          col0  col1  col2  col3  col4
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    

Question 10

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?

The first step is to assign a number to each row – this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:

df2.insert(0, 'count', df.groupby('A').cumcount())
df2

   count  A   B
0      0  a   0
1      1  a  11
2      2  a   2
3      3  a  11
4      0  b  10
5      1  b  10
6      2  b  14
7      0  c   7

The second step is to use the newly created column as the index to call DataFrame.pivot.

df2.pivot(*df)
# df.pivot(index='count', columns='A', values='B')

A         a     b    c
count                 
0       0.0  10.0  7.0
1      11.0  10.0  NaN
2       2.0  14.0  NaN
3      11.0   NaN  NaN

Question 11

How do I flatten the multiple index to single index after pivot

If columns type object with string join

df.columns = df.columns.map('|'.join)

else format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format) 

回答 1

扩展@piRSquared的答案的另一个版本的问题10

问题10.1

数据框:

d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
 'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)

   A  B
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  3  a
6  5  c

输出:

   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

使用df.groupbypd.Series.tolist

t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

或者使用更好的替代方法 pd.pivot_tabledf.squeeze.

t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)

To extend @piRSquared’s answer another version of Question 10

Question 10.1

DataFrame:

d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
 'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)

   A  B
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  3  a
6  5  c

Output:

   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

Using df.groupby and pd.Series.tolist

t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

Or A much better alternative using pd.pivot_table with df.squeeze.

t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)

如何从数据框的单元格获取值?

问题:如何从数据框的单元格获取值?

我构造了一个条件,可以从我的数据帧中准确提取一行:

d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]

现在,我想从特定列中获取一个值:

val = d2['col_name']

但是结果是我得到一个包含一行一列(一个单元格)的数据帧。这不是我所需要的。我需要一个值(一个浮点数)。我该如何在熊猫中做到这一点?

I have constructed a condition that extract exactly one row from my data frame:

d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]

Now I would like to take a value from a particular column:

val = d2['col_name']

But as a result I get a data frame that contains one row and one column (i.e. one cell). It is not what I need. I need one value (one float number). How can I do it in pandas?


回答 0

如果您的DataFrame仅包含一行,则使用iloc,以Series的形式访问第一(唯一)行,然后使用列名访问值:

In [3]: sub_df
Out[3]:
          A         B
2 -0.133653 -0.030854

In [4]: sub_df.iloc[0]
Out[4]:
A   -0.133653
B   -0.030854
Name: 2, dtype: float64

In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493

If you have a DataFrame with only one row, then access the first (only) row as a Series using iloc, and then the value using the column name:

In [3]: sub_df
Out[3]:
          A         B
2 -0.133653 -0.030854

In [4]: sub_df.iloc[0]
Out[4]:
A   -0.133653
B   -0.030854
Name: 2, dtype: float64

In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493

回答 1

这些是标量的快速访问

In [15]: df = pandas.DataFrame(numpy.random.randn(5,3),columns=list('ABC'))

In [16]: df
Out[16]: 
          A         B         C
0 -0.074172 -0.090626  0.038272
1 -0.128545  0.762088 -0.714816
2  0.201498 -0.734963  0.558397
3  1.563307 -1.186415  0.848246
4  0.205171  0.962514  0.037709

In [17]: df.iat[0,0]
Out[17]: -0.074171888537611502

In [18]: df.at[0,'A']
Out[18]: -0.074171888537611502

These are fast access for scalars

In [15]: df = pandas.DataFrame(numpy.random.randn(5,3),columns=list('ABC'))

In [16]: df
Out[16]: 
          A         B         C
0 -0.074172 -0.090626  0.038272
1 -0.128545  0.762088 -0.714816
2  0.201498 -0.734963  0.558397
3  1.563307 -1.186415  0.848246
4  0.205171  0.962514  0.037709

In [17]: df.iat[0,0]
Out[17]: -0.074171888537611502

In [18]: df.at[0,'A']
Out[18]: -0.074171888537611502

回答 2

您可以将1×1数据帧转换为numpy数组,然后访问该数组的第一个也是唯一的值:

val = d2['col_name'].values[0]

You can turn your 1×1 dataframe into a numpy array, then access the first and only value of that array:

val = d2['col_name'].values[0]

回答 3

多数答案都在使用iloc,这对于按位置选择非常有用。

如果需要按标签选择 loc会更方便。

用于显式获取值(等于不推荐使用的df.get_value(’a’,’A’))

# this is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A'] 
Out[55]: 0.13200317033032932

Most answers are using iloc which is good for selection by position.

If you need selection-by-label loc would be more convenient.

For getting a value explicitly (equiv to deprecated df.get_value(‘a’,’A’))

# this is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A'] 
Out[55]: 0.13200317033032932

回答 4

我需要一个由列和索引名称选择的单元格的值。此解决方案为我工作:

original_conversion_frequency.loc[1,:].values[0]

I needed the value of one cell, selected by column and index names. This solution worked for me:

original_conversion_frequency.loc[1,:].values[0]


回答 5

熊猫10.1 / 13.1之后看起来像变化

在iloc不可用之前,我从10.1升级到13.1。

现在有了13.1,iloc[0]['label']将获得单个值数组而不是标量。

像这样:

lastprice=stock.iloc[-1]['Close']

输出:

date
2014-02-26 118.2
name:Close, dtype: float64

It looks like changes after pandas 10.1/13.1

I upgraded from 10.1 to 13.1, before iloc is not available.

Now with 13.1, iloc[0]['label'] gets a single value array rather than a scalar.

Like this:

lastprice=stock.iloc[-1]['Close']

Output:

date
2014-02-26 118.2
name:Close, dtype: float64

回答 6

我找到的最快/最简单的选项如下。501表示行索引。

df.at[501,'column_name']
df.get_value(501,'column_name')

The quickest/easiest options I have found are the following. 501 represents the row index.

df.at[501,'column_name']
df.get_value(501,'column_name')

回答 7

对于0.10的大熊猫,在iloc不可取的地方,过滤a DF并获取列的第一行数据VALUE

df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')

如果过滤的行数超过1,则获取第一行的值。如果过滤器导致数据帧为空,则会出现异常。

For pandas 0.10, where iloc is unavalable, filter a DF and get the first row data for the column VALUE:

df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')

if there is more then 1 row filtered, obtain the first row value. There will be an exception if the filter result in empty data frame.


回答 8

不知道这是否是一个好习惯,但是我注意到我也可以通过将序列强制转换为来获得值float

例如

rate

3 0.042679

名称:Unemployment_rate,dtype:float64

float(rate)

0.0426789

Not sure if this is a good practice, but I noticed I can also get just the value by casting the series as float.

e.g.

rate

3 0.042679

Name: Unemployment_rate, dtype: float64

float(rate)

0.0426789


回答 9

它不需要很复杂:

val = df.loc[df.wd==1, 'col_name'].values[0]

It doesn’t need to be complicated:

val = df.loc[df.wd==1, 'col_name'].values[0]

回答 10

df_gdp.columns

索引([u’Country’,u’Country Code’,u’Indicator Name’,u’Indicator Code’,u’1960’,u’1961’,u’1962’,u’1963’,u’1964′ ,u’1965’,u’1966’,u’1967’,u’1968’,u’1969’,u’1970’,u’1971’,u’1972’,u’1973’,u’1974′ ,u’1975’,u’1976’,u’1977’,u’1978’,u’1979’,u’1980’,u’1981’,u’1982’,u’1983’,u’1984′ ,u’1985’,u’1986’,u’1987’,u’1988’,u’1989’,u’1990’,u’1991’,u’1992’,u’1993’,u’1994′ ,u’1995’,u’1996’,u’1997’,u’1998’,u’1999’,u’2000’,u’2001’,u’2002’,u’2003’,u’2004’,u’2005’,u’2006’,u’2007’,u’2008’,u’2009’,u’2010’, u’2011’,u’2012’,u’2013’,u’2014’,u’2015’,u’2016′],dtype =’object’)

df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]

8100000000000.0

df_gdp.columns

Index([u’Country’, u’Country Code’, u’Indicator Name’, u’Indicator Code’, u’1960′, u’1961′, u’1962′, u’1963′, u’1964′, u’1965′, u’1966′, u’1967′, u’1968′, u’1969′, u’1970′, u’1971′, u’1972′, u’1973′, u’1974′, u’1975′, u’1976′, u’1977′, u’1978′, u’1979′, u’1980′, u’1981′, u’1982′, u’1983′, u’1984′, u’1985′, u’1986′, u’1987′, u’1988′, u’1989′, u’1990′, u’1991′, u’1992′, u’1993′, u’1994′, u’1995′, u’1996′, u’1997′, u’1998′, u’1999′, u’2000′, u’2001′, u’2002′, u’2003′, u’2004′, u’2005′, u’2006′, u’2007′, u’2008′, u’2009′, u’2010′, u’2011′, u’2012′, u’2013′, u’2014′, u’2015′, u’2016′], dtype=’object’)

df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]

8100000000000.0


Python Pandas错误标记数据

问题:Python Pandas错误标记数据

我正在尝试使用熊猫来操作.csv文件,但出现此错误:

pandas.parser.CParserError:标记数据时出错。C错误:第3行中应有2个字段,看到了12

我试图阅读熊猫文档,但一无所获。

我的代码很简单:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

我该如何解决?我应该使用csv模块还是其他语言?

文件来自Morningstar

I’m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

I have tried to read the pandas docs, but found nothing.

My code is simple:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

How can I resolve this? Should I use the csv module or another language ?

File is from Morningstar


回答 0

您也可以尝试;

data = pd.read_csv('file1.csv', error_bad_lines=False)

请注意,这将导致违规行被跳过。

you could also try;

data = pd.read_csv('file1.csv', error_bad_lines=False)

Do note that this will cause the offending lines to be skipped.


回答 1

这可能是一个问题

  • 数据中的分隔符
  • 第一行,如@TomAugspurger指出

要解决此问题,请尝试在调用时指定sepand / or header参数read_csv。例如,

df = pandas.read_csv(fileName, sep='delimiter', header=None)

在上面的代码中,sep定义您的定界符并header=None告诉熊猫您的源数据没有用于标题/列标题的行。因此说文档:“如果文件不包含标题行,那么你应该明确地传递标题=无”。在这种情况下,pandas自动为每个字段{0,1,2,…}创建整数索引。

根据文档,定界符问题应该成为问题。文档说:“如果sep为None [未指定],将尝试自动确定这一点。” 但是,我还没有遇到好运,包括带有明显分隔符的实例。

It might be an issue with

  • the delimiters in your data
  • the first row, as @TomAugspurger noted

To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,

df = pandas.read_csv(fileName, sep='delimiter', header=None)

In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: “If file contains no header row, then you should explicitly pass header=None”. In this instance, pandas automatically creates whole-number indices for each field {0,1,2,…}.

According to the docs, the delimiter thing should not be an issue. The docs say that “if sep is None [not specified], will try to automatically determine this.” I however have not had good luck with this, including instances with obvious delimiters.


回答 2

解析器被文件的标题弄糊涂了。它读取第一行并推断该行的列数。但是前两行并不代表文件中的实际数据。

试试看 data = pd.read_csv(path, skiprows=2)

The parser is getting confused by the header of the file. It reads the first row and infers the number of columns from that row. But the first two rows aren’t representative of the actual data in the file.

Try it with data = pd.read_csv(path, skiprows=2)


回答 3

您的CSV文件可能具有可变的列数,并read_csv从前几行推断出列数。在这种情况下,有两种解决方法:

1)更改CSV文件,使其第一行的虚拟行具有最大的列数(并指定 header=[0]

2)或使用names = list(range(0,N))其中N是最大列数。

Your CSV file might have variable number of columns and read_csv inferred the number of columns from the first few rows. Two ways to solve it in this case:

1) Change the CSV file to have a dummy first line with max number of columns (and specify header=[0])

2) Or use names = list(range(0,N)) where N is the max number of columns.


回答 4

这绝对是定界符的问题,因为大多数csv CSV都是使用创建的,sep='/t'因此请尝试read_csv使用带有分隔符的制表(\t)/t。因此,尝试使用以下代码行打开。

data=pd.read_csv("File_path", sep='\t')

This is definitely an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.

data=pd.read_csv("File_path", sep='\t')

回答 5

我也有这个问题,但也许是出于不同的原因。我的CSV中有一些尾随逗号,这增加了pandas试图读取的附加列。使用以下方法,但它只是忽略了不良之处:

data = pd.read_csv('file1.csv', error_bad_lines=False)

如果要保留这些行以处理错误,请执行以下操作:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

if line != []:
    # Handle the errors however you want

我继续编写脚本以将这些行重新插入到DataFrame中,因为不良行将由上述代码中的变量“ line”给出。只需使用csv阅读器,就可以避免所有这些情况。希望熊猫开发者将来可以使处理这种情况更加容易。

I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:

data = pd.read_csv('file1.csv', error_bad_lines=False)

If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

if line != []:
    # Handle the errors however you want

I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable ‘line’ in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.


回答 6

我遇到了这个问题,我试图在不传递列名的情况下读取CSV文件。

df = pd.read_csv(filename, header=None)

我事先在列表中指定了列名称,然后将它们传递给names,它立即解决了它。如果您没有设置列名,则可以创建与数据中最大列数一样多的占位符名称。

col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)

I had this problem, where I was trying to read in a CSV without passing in column names.

df = pd.read_csv(filename, header=None)

I specified the column names in a list beforehand and then pass them into names, and it solved it immediately. If you don’t have set column names, you could just create as many placeholder names as the maximum number of columns that might be in your data.

col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)

回答 7

我本人几次遇到这个问题。几乎每次,原因都是我试图打开的文件不是正确保存的CSV开头。“适当地”是指每行具有相同数量的分隔符或列。

通常发生这种情况是因为我在Excel中打开了CSV,然后错误地保存了它。即使文件扩展名仍然是.csv,纯CSV格式也已更改。

用pandas to_csv保存的所有文件都将正确格式化,并且不会出现该问题。但是,如果您使用其他程序打开它,则可能会更改结构。

希望能有所帮助。

I’ve had this problem a few times myself. Almost every time, the reason is that the file I was attempting to open was not a properly saved CSV to begin with. And by “properly”, I mean each row had the same number of separators or columns.

Typically it happened because I had opened the CSV in Excel then improperly saved it. Even though the file extension was still .csv, the pure CSV format had been altered.

Any file saved with pandas to_csv will be properly formatted and shouldn’t have that issue. But if you open it with another program, it may change the structure.

Hope that helps.


回答 8

我遇到了同样的问题。使用pd.read_table()相同的源文件似乎工作。我无法找到原因,但这对于我的情况是一个有用的解决方法。也许某个知识渊博的人可以阐明其工作原理。

编辑:我发现当文件中的某些文本与实际数据的格式不同时,此错误会逐渐蔓延。这通常是页眉或页脚信息(多于一行,因此skip_header无效),不会与实际数据用相同数量的逗号分隔(使用read_csv时)。使用read_table使用制表符作为分隔符,可以避免用户当前的错误,但会引入其他错误。

我通常通过将多余的数据读取到文件中然后使用read_csv()方法来解决此问题。

确切的解决方案可能会有所不同,具体取决于您的实际文件,但是这种方法在某些情况下对我有用

I came across the same issue. Using pd.read_table() on the same source file seemed to work. I could not trace the reason for this but it was a useful workaround for my case. Perhaps someone more knowledgeable can shed more light on why it worked.

Edit: I found that this error creeps up when you have some text in your file that does not have the same format as the actual data. This is usually header or footer information (greater than one line, so skip_header doesn’t work) which will not be separated by the same number of commas as your actual data (when using read_csv). Using read_table uses a tab as the delimiter which could circumvent the users current error but introduce others.

I usually get around this by reading the extra data into a file then use the read_csv() method.

The exact solution might differ depending on your actual file, but this approach has worked for me in several cases


回答 9

以下代码对我有用(我发布了此答案,因为我特别在Google合作笔记本中遇到了此问题):

df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)

The following worked for me (I posted this answer, because I specifically had this problem in a Google Colaboratory Notebook):

df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)

回答 10

尝试读取带有空格,逗号和引号的制表符分隔表时,我遇到了类似的问题:

1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

这说明它与C解析引擎(默认引擎)有关。也许更改为python会改变一切

counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')

Segmentation fault (core dumped)

现在,这是一个不同的错误。
如果我们继续尝试从表中删除空格,则python-engine的错误再次更改:

1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""


_csv.Error: '   ' expected after '"'

很明显,熊猫在解析我们的行时遇到问题。要使用python引擎解析表,我需要事先删除表中的所有空格和引号。同时,C引擎即使连续出现逗号也不断崩溃。

为了避免创建带有替换的新文件,我这样做是因为表很小:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
    counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')

tl; dr
更改解析引擎,请尝试避免数据中出现任何非限定性的引号/逗号/空格。

I’ve had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:

1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything

counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')

Segmentation fault (core dumped)

Now that is a different error.
If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:

1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""


_csv.Error: '   ' expected after '"'

And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.

To avoid creating a new file with replacements I did this, as my tables are small:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
    counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')

tl;dr
Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.


回答 11

我使用的数据集使用了很多引号(“)来进行格式化。通过包含以下参数,我能够解决此错误read_csv()

quoting=3 # 3 correlates to csv.QUOTE_NONE for pandas

The dataset that I used had a lot of quote marks (“) used extraneous of the formatting. I was able to fix the error by including this parameter for read_csv():

quoting=3 # 3 correlates to csv.QUOTE_NONE for pandas

回答 12

在参数中使用定界符

pd.read_csv(filename, delimiter=",", encoding='utf-8')

它会读取。

Use delimiter in parameter

pd.read_csv(filename, delimiter=",", encoding='utf-8')

It will read.


回答 13

尽管此问题并非如此,但压缩数据也可能出现此错误。明确设置该值可以kwarg compression解决我的问题。

result = pandas.read_csv(data_source, compression='gzip')

Although not the case for this question, this error may also appear with compressed data. Explicitly setting the value for kwarg compression resolved my problem.

result = pandas.read_csv(data_source, compression='gzip')

回答 14

我发现对处理类似的解析错误有用的另一种方法是使用CSV模块将数据重新路由到pandas df中。例如:

import csv
import pandas as pd
path = 'C:/FileLocation/'
file = 'filename.csv'
f = open(path+file,'rt')
reader = csv.reader(f)

#once contents are available, I then put them in a list
csv_list = []
for l in reader:
    csv_list.append(l)
f.close()
#now pandas has no problem getting into a df
df = pd.DataFrame(csv_list)

我发现CSV模块对于格式较差的逗号分隔文件更加健壮,因此在解决此类问题方面,此方法已取得成功。

An alternative that I have found to be useful in dealing with similar parsing errors uses the CSV module to re-route data into a pandas df. For example:

import csv
import pandas as pd
path = 'C:/FileLocation/'
file = 'filename.csv'
f = open(path+file,'rt')
reader = csv.reader(f)

#once contents are available, I then put them in a list
csv_list = []
for l in reader:
    csv_list.append(l)
f.close()
#now pandas has no problem getting into a df
df = pd.DataFrame(csv_list)

I find the CSV module to be a bit more robust to poorly formatted comma separated files and so have had success with this route to address issues like these.


回答 15

以下命令序列有效(我丢失了数据的第一行-no header = None present-,但至少已加载):

df = pd.read_csv(filename, usecols=range(0, 42)) df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

以下操作无效:

df = pd.read_csv(filename, names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'], usecols=range(0, 42))

CParserError:标记数据时出错。C错误:在1605634行中应有53个字段,看到54个以下内容无效:

df = pd.read_csv(filename, header=None)

CParserError:标记数据时出错。C错误:在1605634行中预期有53个字段,看到了54

因此,在您的问题中,您必须通过 usecols=range(0, 2)

following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):

df = pd.read_csv(filename, usecols=range(0, 42)) df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

Following does NOT work:

df = pd.read_csv(filename, names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'], usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54 Following does NOT work:

df = pd.read_csv(filename, header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Hence, in your problem you have to pass usecols=range(0, 2)


回答 16

对于那些在Linux OS上使用Python 3遇到类似问题的人。

pandas.errors.ParserError: Error tokenizing data. C error: Calling
read(nbytes) on source failed. Try engine='python'.

尝试:

df.read_csv('file.csv', encoding='utf8', engine='python')

For those who are having similar issue with Python 3 on linux OS.

pandas.errors.ParserError: Error tokenizing data. C error: Calling
read(nbytes) on source failed. Try engine='python'.

Try:

df.read_csv('file.csv', encoding='utf8', engine='python')

回答 17

有时问题不在于如何使用python,而在于原始数据。
我收到此错误消息

Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.

事实证明,在列说明中有时会出现逗号。这意味着需要清理CSV文件或使用其他分隔符。

Sometimes the problem is not how to use python, but with the raw data.
I got this error message

Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.

It turned out that in the column description there were sometimes commas. This means that the CSV file needs to be cleaned up or another separator used.


回答 18

采用 pandas.read_csv('CSVFILENAME',header=None,sep=', ')

尝试从链接读取csv数据时

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

我将数据从站点复制到了csvfile中。它有多余的空格,所以使用sep =’,’并且它起作用:)

use pandas.read_csv('CSVFILENAME',header=None,sep=', ')

when trying to read csv data from the link

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

I copied the data from the site into my csvfile. It had extra spaces so used sep =’, ‘ and it worked :)


回答 19

我有一个包含行号的数据集,我使用了index_col:

pd.read_csv('train.csv', index_col=0)

I had a dataset with prexisting row numbers, I used index_col:

pd.read_csv('train.csv', index_col=0)

回答 20

这就是我所做的。

sep='::' 解决了我的问题:

data=pd.read_csv('C:\\Users\\HP\\Downloads\\NPL ASSINGMENT 2 imdb_labelled\\imdb_labelled.txt',engine='python',header=None,sep='::')

This is what I did.

sep='::' solved my issue:

data=pd.read_csv('C:\\Users\\HP\\Downloads\\NPL ASSINGMENT 2 imdb_labelled\\imdb_labelled.txt',engine='python',header=None,sep='::')

回答 21

我有与此类似的情况

train = pd.read_csv('input.csv' , encoding='latin1',engine='python') 

工作了

I had a similar case as this and setting

train = pd.read_csv('input.csv' , encoding='latin1',engine='python') 

worked


回答 22

当read_csv时,我有同样的问题:ParserError:标记数据时出错。我只是将旧的csv文件保存到新的csv文件中。问题已经解决了!

I have the same problem when read_csv: ParserError: Error tokenizing data. I just saved the old csv file to a new csv file. The problem is solved!


回答 23

对我来说,问题在于,当日 CSV追加了一个新列。接受的答案解决方案将无法正常工作,因为如果我使用的话,以后的每一行都会被丢弃error_bad_lines=False

在这种情况下,解决方案是使用中的usecols参数pd.read_csv()。这样,我可以仅指定需要读入CSV的列,并且只要标头列存在(并且列名不变),我的Python代码就可以对将来的CSV更改保持弹性。

usecols : list-like or callable, optional 

Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar',
'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)

这样做的另一个好处是,如果我仅使用3-4列的CSV(具有18-20列),则可以将较少的数据加载到内存中。

The issue for me was that a new column was appended to my CSV intraday. The accepted answer solution would not work as every future row would be discarded if I used error_bad_lines=False.

The solution in this case was to use the usecols parameter in pd.read_csv(). This way I can specify only the columns that I need to read into the CSV and my Python code will remain resilient to future CSV changes so long as a header column exists (and the column names do not change).

usecols : list-like or callable, optional 

Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar',
'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

Example

my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)

Another benefit of this is that I can load way less data into memory if I am only using 3-4 columns of a CSV that has 18-20 columns.


回答 24

简单的解决方法:在excel中打开csv文件,并以csv格式的其他名称文件保存。再次尝试导入spyder,将解决您的问题!

Simple resolution: Open the csv file in excel & save it with different name file of csv format. Again try importing it spyder, Your problem will be resolved!


回答 25

我遇到了带有引号的错误。我使用映射软件,在导出逗号分隔文件时,该软件会在文本项周围加上引号。当使用引号(例如’=英尺和“ =英寸)时,如果引起定界符冲突,则可能会出现问题。请考虑以下示例,该示例指出5英寸的测井记录质量较差:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

5"速记的5 inch方式结束了在工程扔扳手。Excel会简单地删除多余的引号,但Pandas会崩溃而没有error_bad_lines=False上面提到的参数。

I have encountered this error with a stray quotation mark. I use mapping software which will put quotation marks around text items when exporting comma-delimited files. Text which uses quote marks (e.g. ‘ = feet and ” = inches) can be problematic when then induce delimiter collisions. Consider this example which notes that a 5-inch well log print is poor:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

Using 5" as shorthand for 5 inch ends up throwing a wrench in the works. Excel will simply strip off the extra quote mark, but Pandas breaks down without the error_bad_lines=False argument mentioned above.


回答 26

据我所知,在查看文件后,问题在于您要加载的csv文件具有多个表。有空行或包含表标题的行。尝试看看这个Stackoverflow答案。它显示了如何以编程方式实现这一目标。

做到这一点的另一种动态方法是使用csv模块,一次读取每一行并进行完整性检查/正则表达式,以推断该行是否为(title / header / values / blank)。使用此方法还有一个优势,即可以根据需要在python对象中拆分/追加/收集数据。

最简单的方法是pd.read_clipboard()在手动选择表格并将其复制到剪贴板后使用pandas功能,以防您可以在excel中打开CSV或其他功能。

不相关的

此外,与您的问题无关,但是因为没有人提到此问题:seeds_dataset.txt从UCI 加载某些数据集时,我遇到了同样的问题。在我的情况下,发生此错误是因为某些分隔符比真正的tab具有更多的空格\t。例如,请参见下面的第3行

14.38   14.21   0.8951  5.386   3.312   2.462   4.956   1
14.69   14.49   0.8799  5.563   3.259   3.586   5.219   1
14.11   14.1    0.8911  5.42    3.302   2.7     5       1

因此,请使用\t+分隔符样式代替\t

data = pd.read_csv(path, sep='\t+`, header=None)

As far as I can tell, and after taking a look at your file, the problem is that the csv file you’re trying to load has multiple tables. There are empty lines, or lines that contain table titles. Try to have a look at this Stackoverflow answer. It shows how to achieve that programmatically.

Another dynamic approach to do that would be to use the csv module, read every single row at a time and make sanity checks/regular expressions, to infer if the row is (title/header/values/blank). You have one more advantage with this approach, that you can split/append/collect your data in python objects as desired.

The easiest of all would be to use pandas function pd.read_clipboard() after manually selecting and copying the table to the clipboard, in case you can open the csv in excel or something.

Irrelevant:

Additionally, irrelevant to your problem, but because no one made mention of this: I had this same issue when loading some datasets such as seeds_dataset.txt from UCI. In my case, the error was occurring because some separators had more whitespaces than a true tab \t. See line 3 in the following for instance

14.38   14.21   0.8951  5.386   3.312   2.462   4.956   1
14.69   14.49   0.8799  5.563   3.259   3.586   5.219   1
14.11   14.1    0.8911  5.42    3.302   2.7     5       1

Therefore, use \t+ in the separator pattern instead of \t.

data = pd.read_csv(path, sep='\t+`, header=None)

回答 27

就我而言,这是因为csv文件的第一行和最后两行的格式与文件的中间内容不同。

因此,我要做的是将csv文件作为字符串打开,解析字符串的内容,然后用于read_csv获取数据框。

import io
import pandas as pd

file = open(f'{file_path}/{file_name}', 'r')
content = file.read()

# change new line character from '\r\n' to '\n'
lines = content.replace('\r', '').split('\n')

# Remove the first and last 2 lines of the file
# StringIO can be considered as a file stored in memory
df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)

In my case, it is because the format of the first and last two lines of the csv file is different from the middle content of the file.

So what I do is open the csv file as a string, parse the content of the string, then use read_csv to get a dataframe.

import io
import pandas as pd

file = open(f'{file_path}/{file_name}', 'r')
content = file.read()

# change new line character from '\r\n' to '\n'
lines = content.replace('\r', '').split('\n')

# Remove the first and last 2 lines of the file
# StringIO can be considered as a file stored in memory
df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)

回答 28

在我的情况下,分隔符不是默认的“,”,而是Tab。

pd.read_csv(file_name.csv, sep='\\t',lineterminator='\\r', engine='python', header='infer')

注意:“ \ t”不符合某些来源的建议。需要“ \\ t”。

In my case the separator was not the default “,” but Tab.

pd.read_csv(file_name.csv, sep='\\t',lineterminator='\\r', engine='python', header='infer')

Note: “\t” did not work as suggested by some sources. “\\t” was required.


回答 29

我有一个类似的错误,问题是我的csv文件中有一些转义的引号,并且需要适当地设置escapechar参数。

I had a similar error and the issue was that I had some escaped quotes in my csv file and needed to set the escapechar parameter appropriately.


熊猫:使用运算符链接过滤DataFrame的行

问题:熊猫:使用运算符链接过滤DataFrame的行

在大部分操作pandas可以与运营商链接(来完成groupbyaggregateapply,等),但我发现过滤行唯一方法是通过正常的托架索引

df_filtered = df[df['column'] == value]

这没有吸引力,因为它要求我先分配df一个变量,然后才能根据其值进行过滤。还有以下内容吗?

df_filtered = df.mask(lambda x: x['column'] == value)

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I’ve found to filter rows is via normal bracket indexing

df_filtered = df[df['column'] == value]

This is unappealing as it requires I assign df to a variable before being able to filter on its values. Is there something more like the following?

df_filtered = df.mask(lambda x: x['column'] == value)

回答 0

我不确定您想要什么,您的最后一行代码也无济于事,但是无论如何:

通过“链接”布尔索引中的条件来完成“链式”过滤。

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

如果要链接方法,可以添加自己的mask方法并使用该方法。

In [90]: def mask(df, key, value):
   ....:     return df[df[key] == value]
   ....:

In [92]: pandas.DataFrame.mask = mask

In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))

In [95]: df.ix['d','A'] = df.ix['a', 'A']

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [97]: df.mask('A', 1)
Out[97]:
   A  B  C  D
a  1  4  9  1
d  1  3  9  6

In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
   A  B  C  D
d  1  3  9  6

I’m not entirely sure what you want, and your last line of code does not help either, but anyway:

“Chained” filtering is done by “chaining” the criteria in the boolean index.

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

If you want to chain methods, you can add your own mask method and use that one.

In [90]: def mask(df, key, value):
   ....:     return df[df[key] == value]
   ....:

In [92]: pandas.DataFrame.mask = mask

In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))

In [95]: df.ix['d','A'] = df.ix['a', 'A']

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [97]: df.mask('A', 1)
Out[97]:
   A  B  C  D
a  1  4  9  1
d  1  3  9  6

In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
   A  B  C  D
d  1  3  9  6

回答 1

可以使用Pandas 查询链接过滤器:

df = pd.DataFrame(np.random.randn(30, 3), columns=['a','b','c'])
df_filtered = df.query('a > 0').query('0 < b < 2')

过滤器也可以组合在一个查询中:

df_filtered = df.query('a > 0 and 0 < b < 2')

Filters can be chained using a Pandas query:

df = pd.DataFrame(np.random.randn(30, 3), columns=['a','b','c'])
df_filtered = df.query('a > 0').query('0 < b < 2')

Filters can also be combined in a single query:

df_filtered = df.query('a > 0 and 0 < b < 2')

回答 2

@lodagro的答案很好。我可以通过将mask函数概括为:

def mask(df, f):
  return df[f(df)]

然后,您可以执行以下操作:

df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)

The answer from @lodagro is great. I would extend it by generalizing the mask function as:

def mask(df, f):
  return df[f(df)]

Then you can do stuff like:

df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)

回答 3

0.18.1版开始,.loc方法接受可调用的选择。与lambda函数一起,您可以创建非常灵活的可链接过滤器:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.loc[lambda df: df.A == 80]  # equivalent to df[df.A == 80] but chainable

df.sort_values('A').loc[lambda df: df.A > 80].loc[lambda df: df.B > df.A]

如果您所做的只是过滤,也可以省略.loc

Since version 0.18.1 the .loc method accepts a callable for selection. Together with lambda functions you can create very flexible chainable filters:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.loc[lambda df: df.A == 80]  # equivalent to df[df.A == 80] but chainable

df.sort_values('A').loc[lambda df: df.A > 80].loc[lambda df: df.B > df.A]

If all you’re doing is filtering, you can also omit the .loc.


回答 4

我提供了其他示例。这是与https://stackoverflow.com/a/28159296/相同的答案

我将添加其他修改,以使该帖子更有用。

pandas.DataFrame.query
query正是出于这个目的。考虑数据框df

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(10, size=(10, 5)),
    columns=list('ABCDE')
)

df

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

让我们使用query过滤所有行D > B

df.query('D > B')

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

我们连锁

df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

I offer this for additional examples. This is the same answer as https://stackoverflow.com/a/28159296/

I’ll add other edits to make this post more useful.

pandas.DataFrame.query
query was made for exactly this purpose. Consider the dataframe df

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(10, size=(10, 5)),
    columns=list('ABCDE')
)

df

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

Let’s use query to filter all rows where D > B

df.query('D > B')

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

Which we chain

df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

回答 5

除了要将条件合并为OR条件外,我有相同的问题。Wouter Overmeire给出的格式将条件合并为AND条件,因此必须同时满足两个条件:

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

但是我发现,如果将每个条件包装起来(... == True)并用管道将这些条件连接起来,则这些条件将以OR条件组合,只要它们中的任何一个为true都将满足:

df[((df.A==1) == True) | ((df.D==6) == True)]

I had the same question except that I wanted to combine the criteria into an OR condition. The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied:

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

But I found that, if you wrap each condition in (... == True) and join the criteria with a pipe, the criteria are combined in an OR condition, satisfied whenever either of them is true:

df[((df.A==1) == True) | ((df.D==6) == True)]

回答 6

熊猫提供了Wouter Overmeire答案的两种替代方法,不需要任何替代。一个是.loc[.]可调用的,例如

df_filtered = df.loc[lambda x: x['column'] == value]

另一个是.pipe(),如

df_filtered = df.pipe(lambda x: x['column'] == value)

pandas provides two alternatives to Wouter Overmeire’s answer which do not require any overriding. One is .loc[.] with a callable, as in

df_filtered = df.loc[lambda x: x['column'] == value]

the other is .pipe(), as in

df_filtered = df.pipe(lambda x: x['column'] == value)

回答 7

我的答案与其他人相似。如果您不想创建新功能,则可以使用已经为您定义的pandas。使用管道方法。

df.pipe(lambda d: d[d['column'] == value])

My answer is similar to the others. If you do not want to create a new function you can use what pandas has defined for you already. Use the pipe method.

df.pipe(lambda d: d[d['column'] == value])

回答 8

如果您想应用所有通用布尔掩码以及通用掩码,则可以将以下内容放在文件中,然后按如下所示简单地分配它们:

pd.DataFrame = apply_masks()

用法:

A = pd.DataFrame(np.random.randn(4, 4), columns=["A", "B", "C", "D"])
A.le_mask("A", 0.7).ge_mask("B", 0.2)... (May be repeated as necessary

这有点骇人听闻,但是如果您不断根据过滤器来分割和更改数据集,则可以使事情变得更清晰。gen_mask函数中还有一个上面丹尼尔·韦尔科夫(Daniel Velkov)改编的通用过滤器,您可以将其与lambda函数或其他需要的函数一起使用。

要保存的文件(我使用masks.py):

import pandas as pd

def eq_mask(df, key, value):
    return df[df[key] == value]

def ge_mask(df, key, value):
    return df[df[key] >= value]

def gt_mask(df, key, value):
    return df[df[key] > value]

def le_mask(df, key, value):
    return df[df[key] <= value]

def lt_mask(df, key, value):
    return df[df[key] < value]

def ne_mask(df, key, value):
    return df[df[key] != value]

def gen_mask(df, f):
    return df[f(df)]

def apply_masks():

    pd.DataFrame.eq_mask = eq_mask
    pd.DataFrame.ge_mask = ge_mask
    pd.DataFrame.gt_mask = gt_mask
    pd.DataFrame.le_mask = le_mask
    pd.DataFrame.lt_mask = lt_mask
    pd.DataFrame.ne_mask = ne_mask
    pd.DataFrame.gen_mask = gen_mask

    return pd.DataFrame

if __name__ == '__main__':
    pass

If you would like to apply all of the common boolean masks as well as a general purpose mask you can chuck the following in a file and then simply assign them all as follows:

pd.DataFrame = apply_masks()

Usage:

A = pd.DataFrame(np.random.randn(4, 4), columns=["A", "B", "C", "D"])
A.le_mask("A", 0.7).ge_mask("B", 0.2)... (May be repeated as necessary

It’s a little bit hacky but it can make things a little bit cleaner if you’re continuously chopping and changing datasets according to filters. There’s also a general purpose filter adapted from Daniel Velkov above in the gen_mask function which you can use with lambda functions or otherwise if desired.

File to be saved (I use masks.py):

import pandas as pd

def eq_mask(df, key, value):
    return df[df[key] == value]

def ge_mask(df, key, value):
    return df[df[key] >= value]

def gt_mask(df, key, value):
    return df[df[key] > value]

def le_mask(df, key, value):
    return df[df[key] <= value]

def lt_mask(df, key, value):
    return df[df[key] < value]

def ne_mask(df, key, value):
    return df[df[key] != value]

def gen_mask(df, f):
    return df[f(df)]

def apply_masks():

    pd.DataFrame.eq_mask = eq_mask
    pd.DataFrame.ge_mask = ge_mask
    pd.DataFrame.gt_mask = gt_mask
    pd.DataFrame.le_mask = le_mask
    pd.DataFrame.lt_mask = lt_mask
    pd.DataFrame.ne_mask = ne_mask
    pd.DataFrame.gen_mask = gen_mask

    return pd.DataFrame

if __name__ == '__main__':
    pass

回答 9

该解决方案在实现方面更缺乏技巧,但我发现它的用法更加简洁,并且肯定比其他建议的方案更通用。

https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

您无需下载整个存储库:保存文件并执行

from where import where as W

应该足够了。然后像这样使用它:

df = pd.DataFrame([[1, 2, True],
                   [3, 4, False], 
                   [5, 7, True]],
                  index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])

一个不太愚蠢的用法示例:

data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]

顺便说一句:即使在您仅使用布尔cols的情况下,

df.loc[W['cond1']].loc[W['cond2']]

可以比

df.loc[W['cond1'] & W['cond2']]

因为它的计算结果cond2只在cond1True

免责声明:我首先在其他地方给出了此答案,因为我还没有看到。

This solution is more hackish in terms of implementation, but I find it much cleaner in terms of usage, and it is certainly more general than the others proposed.

https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

You don’t need to download the entire repo: saving the file and doing

from where import where as W

should suffice. Then you use it like this:

df = pd.DataFrame([[1, 2, True],
                   [3, 4, False], 
                   [5, 7, True]],
                  index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])

A slightly less stupid usage example:

data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]

By the way: even in the case in which you are just using boolean cols,

df.loc[W['cond1']].loc[W['cond2']]

can be much more efficient than

df.loc[W['cond1'] & W['cond2']]

because it evaluates cond2 only where cond1 is True.

DISCLAIMER: I first gave this answer elsewhere because I hadn’t seen this.


回答 10

只想使用添加演示 loc不仅用于按行过滤,还可以按列过滤,并对链式操作有一些优点。

下面的代码可以按值过滤行。

df_filtered = df.loc[df['column'] == value]

通过稍作修改,您也可以过滤列。

df_filtered = df.loc[df['column'] == value, ['year', 'column']]

那么为什么我们要使用链式方法呢?答案是,如果您有很多操作,它很容易阅读。例如,

res =  df\
    .loc[df['station']=='USA', ['TEMP', 'RF']]\
    .groupby('year')\
    .agg(np.nanmean)

Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation.

The code below can filter the rows by value.

df_filtered = df.loc[df['column'] == value]

By modifying it a bit you can filter the columns as well.

df_filtered = df.loc[df['column'] == value, ['year', 'column']]

So why do we want a chained method? The answer is that it is simple to read if you have many operations. For example,

res =  df\
    .loc[df['station']=='USA', ['TEMP', 'RF']]\
    .groupby('year')\
    .agg(np.nanmean)

回答 11

这没有吸引力,因为它要求我先分配df一个变量,然后才能根据其值进行过滤。

df[df["column_name"] != 5].groupby("other_column_name")

似乎有效:您也可以嵌套[]运算符。也许他们是在您提出问题后才添加的。

This is unappealing as it requires I assign df to a variable before being able to filter on its values.

df[df["column_name"] != 5].groupby("other_column_name")

seems to work: you can nest the [] operator as well. Maybe they added it since you asked the question.


回答 12

如果将列设置为作为索引搜索,则可以使用DataFrame.xs()横截面。这没有query答案的通用性,但在某些情况下可能有用。

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(3, size=(10, 5)),
    columns=list('ABCDE')
)

df
# Out[55]: 
#    A  B  C  D  E
# 0  0  2  2  2  2
# 1  1  1  2  0  2
# 2  0  2  0  0  2
# 3  0  2  2  0  1
# 4  0  1  1  2  0
# 5  0  0  0  1  2
# 6  1  0  1  1  1
# 7  0  0  2  0  2
# 8  2  2  2  2  2
# 9  1  2  0  2  1

df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]: 
#    A  D  B  C  E
# 0  0  2  2  2  2
# 1  0  2  1  1  0

If you set your columns to search as indexes, then you can use DataFrame.xs() to take a cross section. This is not as versatile as the query answers, but it might be useful in some situations.

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(3, size=(10, 5)),
    columns=list('ABCDE')
)

df
# Out[55]: 
#    A  B  C  D  E
# 0  0  2  2  2  2
# 1  1  1  2  0  2
# 2  0  2  0  0  2
# 3  0  2  2  0  1
# 4  0  1  1  2  0
# 5  0  0  0  1  2
# 6  1  0  1  1  1
# 7  0  0  2  0  2
# 8  2  2  2  2  2
# 9  1  2  0  2  1

df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]: 
#    A  D  B  C  E
# 0  0  2  2  2  2
# 1  0  2  1  1  0

回答 13

您还可以将numpy库用于逻辑操作。它非常快。

df[np.logical_and(df['A'] == 1 ,df['B'] == 6)]

You can also leverage the numpy library for logical operations. Its pretty fast.

df[np.logical_and(df['A'] == 1 ,df['B'] == 6)]

熊猫-如何展平列中的层次结构索引

问题:熊猫-如何展平列中的层次结构索引

我有一个在轴1(列)中具有层次结构索引的数据框(来自groupby.agg操作):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf       
                                     sum   sum   sum    sum   amax   amin
0  702730  26451  1993      1    1     1     0    12     13  30.92  24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00  24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00   6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04   3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94  10.94

我想将其展平,使其看起来像这样(名称不是关键的-我可以重命名):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf_amax  tmpf_amin   
0  702730  26451  1993      1    1     1     0    12     13  30.92          24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00          24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00          6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04          3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94          10.94

我该怎么做呢?(我已经尝试了很多,无济于事。)

根据建议,这是字典形式的头

{('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}

I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg operation):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf       
                                     sum   sum   sum    sum   amax   amin
0  702730  26451  1993      1    1     1     0    12     13  30.92  24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00  24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00   6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04   3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94  10.94

I want to flatten it, so that it looks like this (names aren’t critical – I could rename):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf_amax  tmpf_amin   
0  702730  26451  1993      1    1     1     0    12     13  30.92          24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00          24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00          6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04          3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94          10.94

How do I do this? (I’ve tried a lot, to no avail.)

Per a suggestion, here is the head in dict form

{('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}

回答 0

我认为最简单的方法是将列设置为顶级:

df.columns = df.columns.get_level_values(0)

注意:如果to级别具有名称,您也可以通过此名称(而不是0)来访问它。

如果要将joinMultiIndex 组合成一个索引(假设您的列中仅包含字符串条目),则可以:

df.columns = [' '.join(col).strip() for col in df.columns.values]

注意:strip没有第二个索引时,必须使用空格。

In [11]: [' '.join(col).strip() for col in df.columns.values]
Out[11]: 
['USAF',
 'WBAN',
 'day',
 'month',
 's_CD sum',
 's_CL sum',
 's_CNT sum',
 's_PC sum',
 'tempf amax',
 'tempf amin',
 'year']

I think the easiest way to do this would be to set the columns to the top level:

df.columns = df.columns.get_level_values(0)

Note: if the to level has a name you can also access it by this, rather than 0.

.

If you want to combine/join your MultiIndex into one Index (assuming you have just string entries in your columns) you could:

df.columns = [' '.join(col).strip() for col in df.columns.values]

Note: we must strip the whitespace for when there is no second index.

In [11]: [' '.join(col).strip() for col in df.columns.values]
Out[11]: 
['USAF',
 'WBAN',
 'day',
 'month',
 's_CD sum',
 's_CL sum',
 's_CNT sum',
 's_PC sum',
 'tempf amax',
 'tempf amin',
 'year']

回答 1

pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only
pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only

回答 2

该线程上的所有当前答案都必须已过时。从pandas0.24.0版开始,.to_flat_index()您需要做什么。

从熊猫自己的文档中

MultiIndex.to_flat_index()

将MultiIndex转换为包含级别值的元组索引。

文档中的一个简单示例:

import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
        [['foo', 'bar'], ['baz', 'qux']],
        names=['a', 'b'])

print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
#           codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
#           names=['a', 'b'])

申请to_flat_index()

index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')

用它来替换现有的pandas

一个如何在上使用它的示例dat,它是一个带有MultiIndex列的DataFrame :

dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
#            codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])

dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'),  ('class_size', 'mean'),
#     ('class_size', 'std'),   ('class_size', 'min'),
#     ('class_size', '25%'),   ('class_size', '50%'),
#     ('class_size', '75%'),   ('class_size', 'max')],
#  dtype='object')

All of the current answers on this thread must have been a bit dated. As of pandas version 0.24.0, the .to_flat_index() does what you need.

From panda’s own documentation:

MultiIndex.to_flat_index()

Convert a MultiIndex to an Index of Tuples containing the level values.

A simple example from its documentation:

import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
        [['foo', 'bar'], ['baz', 'qux']],
        names=['a', 'b'])

print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
#           codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
#           names=['a', 'b'])

Applying to_flat_index():

index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')

Using it to replace existing pandas column

An example of how you’d use it on dat, which is a DataFrame with a MultiIndex column:

dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
#            codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])

dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'),  ('class_size', 'mean'),
#     ('class_size', 'std'),   ('class_size', 'min'),
#     ('class_size', '25%'),   ('class_size', '50%'),
#     ('class_size', '75%'),   ('class_size', 'max')],
#  dtype='object')

回答 3

安迪·海登(Andy Hayden)的答案当然是最简单的方法-如果要避免重复的列标签,则需要进行一些调整

In [34]: df
Out[34]: 
     USAF   WBAN  day  month  s_CD  s_CL  s_CNT  s_PC  tempf         year
                               sum   sum    sum   sum   amax   amin      
0  702730  26451    1      1    12     0     13     1  30.92  24.98  1993
1  702730  26451    2      1    13     0     13     0  32.00  24.98  1993
2  702730  26451    3      1     2    10     13     1  23.00   6.98  1993
3  702730  26451    4      1    12     0     13     1  10.04   3.92  1993
4  702730  26451    5      1    10     0     13     3  19.94  10.94  1993


In [35]: mi = df.columns

In [36]: mi
Out[36]: 
MultiIndex
[(USAF, ), (WBAN, ), (day, ), (month, ), (s_CD, sum), (s_CL, sum), (s_CNT, sum), (s_PC, sum), (tempf, amax), (tempf, amin), (year, )]


In [37]: mi.tolist()
Out[37]: 
[('USAF', ''),
 ('WBAN', ''),
 ('day', ''),
 ('month', ''),
 ('s_CD', 'sum'),
 ('s_CL', 'sum'),
 ('s_CNT', 'sum'),
 ('s_PC', 'sum'),
 ('tempf', 'amax'),
 ('tempf', 'amin'),
 ('year', '')]

In [38]: ind = pd.Index([e[0] + e[1] for e in mi.tolist()])

In [39]: ind
Out[39]: Index([USAF, WBAN, day, month, s_CDsum, s_CLsum, s_CNTsum, s_PCsum, tempfamax, tempfamin, year], dtype=object)

In [40]: df.columns = ind




In [46]: df
Out[46]: 
     USAF   WBAN  day  month  s_CDsum  s_CLsum  s_CNTsum  s_PCsum  tempfamax  tempfamin  \
0  702730  26451    1      1       12        0        13        1      30.92      24.98   
1  702730  26451    2      1       13        0        13        0      32.00      24.98   
2  702730  26451    3      1        2       10        13        1      23.00       6.98   
3  702730  26451    4      1       12        0        13        1      10.04       3.92   
4  702730  26451    5      1       10        0        13        3      19.94      10.94   




   year  
0  1993  
1  1993  
2  1993  
3  1993  
4  1993

Andy Hayden’s answer is certainly the easiest way — if you want to avoid duplicate column labels you need to tweak a bit

In [34]: df
Out[34]: 
     USAF   WBAN  day  month  s_CD  s_CL  s_CNT  s_PC  tempf         year
                               sum   sum    sum   sum   amax   amin      
0  702730  26451    1      1    12     0     13     1  30.92  24.98  1993
1  702730  26451    2      1    13     0     13     0  32.00  24.98  1993
2  702730  26451    3      1     2    10     13     1  23.00   6.98  1993
3  702730  26451    4      1    12     0     13     1  10.04   3.92  1993
4  702730  26451    5      1    10     0     13     3  19.94  10.94  1993


In [35]: mi = df.columns

In [36]: mi
Out[36]: 
MultiIndex
[(USAF, ), (WBAN, ), (day, ), (month, ), (s_CD, sum), (s_CL, sum), (s_CNT, sum), (s_PC, sum), (tempf, amax), (tempf, amin), (year, )]


In [37]: mi.tolist()
Out[37]: 
[('USAF', ''),
 ('WBAN', ''),
 ('day', ''),
 ('month', ''),
 ('s_CD', 'sum'),
 ('s_CL', 'sum'),
 ('s_CNT', 'sum'),
 ('s_PC', 'sum'),
 ('tempf', 'amax'),
 ('tempf', 'amin'),
 ('year', '')]

In [38]: ind = pd.Index([e[0] + e[1] for e in mi.tolist()])

In [39]: ind
Out[39]: Index([USAF, WBAN, day, month, s_CDsum, s_CLsum, s_CNTsum, s_PCsum, tempfamax, tempfamin, year], dtype=object)

In [40]: df.columns = ind




In [46]: df
Out[46]: 
     USAF   WBAN  day  month  s_CDsum  s_CLsum  s_CNTsum  s_PCsum  tempfamax  tempfamin  \
0  702730  26451    1      1       12        0        13        1      30.92      24.98   
1  702730  26451    2      1       13        0        13        0      32.00      24.98   
2  702730  26451    3      1        2       10        13        1      23.00       6.98   
3  702730  26451    4      1       12        0        13        1      10.04       3.92   
4  702730  26451    5      1       10        0        13        3      19.94      10.94   




   year  
0  1993  
1  1993  
2  1993  
3  1993  
4  1993

回答 4

df.columns = ['_'.join(tup).rstrip('_') for tup in df.columns.values]
df.columns = ['_'.join(tup).rstrip('_') for tup in df.columns.values]

回答 5

而且,如果您想保留第二级多索引中的任何聚合信息,则可以尝试以下操作:

In [1]: new_cols = [''.join(t) for t in df.columns]
Out[1]:
['USAF',
 'WBAN',
 'day',
 'month',
 's_CDsum',
 's_CLsum',
 's_CNTsum',
 's_PCsum',
 'tempfamax',
 'tempfamin',
 'year']

In [2]: df.columns = new_cols

And if you want to retain any of the aggregation info from the second level of the multiindex you can try this:

In [1]: new_cols = [''.join(t) for t in df.columns]
Out[1]:
['USAF',
 'WBAN',
 'day',
 'month',
 's_CDsum',
 's_CLsum',
 's_CNTsum',
 's_PCsum',
 'tempfamax',
 'tempfamin',
 'year']

In [2]: df.columns = new_cols

回答 6

使用map函数的最pythonic方法。

df.columns = df.columns.map(' '.join).str.strip()

输出print(df.columns)

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

使用Python 3.6+和f字符串进行更新:

df.columns = [f'{f} {s}' if s != '' else f'{f}' 
              for f, s in df.columns]

print(df.columns)

输出:

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

The most pythonic way to do this to use map function.

df.columns = df.columns.map(' '.join).str.strip()

Output print(df.columns):

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

Update using Python 3.6+ with f string:

df.columns = [f'{f} {s}' if s != '' else f'{f}' 
              for f, s in df.columns]

print(df.columns)

Output:

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

回答 7

对我来说,最简单,最直观的解决方案是使用get_level_values组合列名称。当您在同一列上执行多个聚合时,这可以防止重复的列名称:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
df.columns = level_one + level_two

如果要在列之间使用分隔符,则可以执行此操作。这将返回与Seiji Armstrong关于已接受答案的评论相同的内容,该评论仅包括两个索引级别中的值的列的下划线:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
column_separator = ['_' if x != '' else '' for x in level_two]
df.columns = level_one + column_separator + level_two

我知道这与Andy Hayden的出色答案具有相同的作用,但我认为这种方式更直观,并且更容易记住(因此,我不必继续引用此线程),尤其是对于熊猫新手用户。

在您可能具有3个列级别的情况下,此方法也可以扩展。

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
level_three = df.columns.get_level_values(2).astype(str)
df.columns = level_one + level_two + level_three

The easiest and most intuitive solution for me was to combine the column names using get_level_values. This prevents duplicate column names when you do more than one aggregation on the same column:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
df.columns = level_one + level_two

If you want a separator between columns, you can do this. This will return the same thing as Seiji Armstrong’s comment on the accepted answer that only includes underscores for columns with values in both index levels:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
column_separator = ['_' if x != '' else '' for x in level_two]
df.columns = level_one + column_separator + level_two

I know this does the same thing as Andy Hayden’s great answer above, but I think it is a bit more intuitive this way and is easier to remember (so I don’t have to keep referring to this thread), especially for novice pandas users.

This method is also more extensible in the case where you may have 3 column levels.

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
level_three = df.columns.get_level_values(2).astype(str)
df.columns = level_one + level_two + level_three

回答 8

阅读完所有答案后,我想到了:

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

用法:

给定一个数据框:

df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])

  grouper  val1  2
0       x     0  1
1       x     2  3
2       y     4  5
3       y     6  7
  • 单一聚合方法与源名称相同的结果变量:

    df.groupby(by="grouper").agg("min").my_flatten_cols()
    • df.groupby(by="grouper", as_index = False).agg(...).reset_index()相同
    • ----- before -----
                 val1  2
        grouper         
      
      ------ after -----
        grouper  val1  2
      0       x     0  1
      1       y     4  5
  • 单源变量,多个聚合以统计信息命名的结果变量:

    df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last")
    • 与相同a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index()
    • ----- before -----
                  val1    
                 min max
        grouper         
      
      ------ after -----
        grouper  min  max
      0       x    0    2
      1       y    4    6
  • 多个变量,多个聚合:名为(varname)_(statname)的结果变量:

    df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols()
    # you can combine the names in other ways too, e.g. use a different delimiter:
    #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join)
    • 在后台运行a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values](因为这种形式的agg()结果出现MultiIndex在列上)。
    • 如果您没有my_flatten_cols帮助者,则输入@Seigi建议的解决方案可能会更容易:a.columns = ["_".join(t).rstrip("_") for t in a.columns.values]在这种情况下,它的工作原理类似(但如果列上有数字标签,则会失败)
    • 要处理列上的数字标签,可以使用@jxstanford和@Nolan Conawaya.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values])建议的解决方案,但我不明白为什么tuple()需要调用,并且我相信rstrip()只有在某些列具有类似("colname", "")(如果您reset_index()在尝试修复之前会发生这种情况.columns
    • ----- before -----
                 val1           2     
                 min       sum    size
        grouper              
      
      ------ after -----
        grouper  val1_min  2_sum  2_size
      0       x         0      4       2
      1       y         4     12       2
  • 要手动命名结果变量:(这是因为大熊猫0.20.0弃用没有适当的替代性为0.23

    df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"},
                                       2: {"sum_of_2":    "sum", "count_of_2":    "count"}}).my_flatten_cols("last")
    • 其他建议包括:手动设置列:res.columns = ['A_sum', 'B_sum', 'count'].join()输入多个groupby语句。
    • ----- before -----
                         val1                      2         
                count_of_val1 sum_of_val1 count_of_2 sum_of_2
        grouper                                              
      
      ------ after -----
        grouper  count_of_val1  sum_of_val1  count_of_2  sum_of_2
      0       x              2            2           2         4
      1       y              2           10           2        12

助手功能处理的案件

  • 级别名称可以是非字符串,例如,当列名称是整数时按列号使用Index pandas DataFrame,因此我们必须使用map(str, ..)
  • 它们也可以是空的,所以我们必须 filter(None, ..)
  • 对于单级列(即,除MultiIndex之外的任何内容),columns.values返回名称(str,而不是元组)
  • 根据您的使用方式,.agg()您可能需要保留一列的最底端标签或连接多个标签
  • (因为我是熊猫新手?),我希望reset_index()能够以常规方式使用group-by列,因此默认情况下会这样做

After reading through all the answers, I came up with this:

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

Usage:

Given a data frame:

df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])

  grouper  val1  2
0       x     0  1
1       x     2  3
2       y     4  5
3       y     6  7
  • Single aggregation method: resulting variables named the same as source:

    df.groupby(by="grouper").agg("min").my_flatten_cols()
    
    • Same as df.groupby(by="grouper", as_index=False) or .agg(...).reset_index()
    • ----- before -----
                 val1  2
        grouper         
      
      ------ after -----
        grouper  val1  2
      0       x     0  1
      1       y     4  5
      
  • Single source variable, multiple aggregations: resulting variables named after statistics:

    df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last")
    
    • Same as a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index().
    • ----- before -----
                  val1    
                 min max
        grouper         
      
      ------ after -----
        grouper  min  max
      0       x    0    2
      1       y    4    6
      
  • Multiple variables, multiple aggregations: resulting variables named (varname)_(statname):

    df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols()
    # you can combine the names in other ways too, e.g. use a different delimiter:
    #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join)
    
    • Runs a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values] under the hood (since this form of agg() results in MultiIndex on columns).
    • If you don’t have the my_flatten_cols helper, it might be easier to type in the solution suggested by @Seigi: a.columns = ["_".join(t).rstrip("_") for t in a.columns.values], which works similarly in this case (but fails if you have numeric labels on columns)
    • To handle the numeric labels on columns, you could use the solution suggested by @jxstanford and @Nolan Conaway (a.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values]), but I don’t understand why the tuple() call is needed, and I believe rstrip() is only required if some columns have a descriptor like ("colname", "") (which can happen if you reset_index() before trying to fix up .columns)
    • ----- before -----
                 val1           2     
                 min       sum    size
        grouper              
      
      ------ after -----
        grouper  val1_min  2_sum  2_size
      0       x         0      4       2
      1       y         4     12       2
      
  • You want to name the resulting variables manually: (this is deprecated since pandas 0.20.0 with no adequate alternative as of 0.23)

    df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"},
                                       2: {"sum_of_2":    "sum", "count_of_2":    "count"}}).my_flatten_cols("last")
    
    • Other suggestions include: setting the columns manually: res.columns = ['A_sum', 'B_sum', 'count'] or .join()ing multiple groupby statements.
    • ----- before -----
                         val1                      2         
                count_of_val1 sum_of_val1 count_of_2 sum_of_2
        grouper                                              
      
      ------ after -----
        grouper  count_of_val1  sum_of_val1  count_of_2  sum_of_2
      0       x              2            2           2         4
      1       y              2           10           2        12
      

Cases handled by the helper function

  • level names can be non-string, e.g. Index pandas DataFrame by column numbers, when column names are integers, so we have to convert with map(str, ..)
  • they can also be empty, so we have to filter(None, ..)
  • for single-level columns (i.e. anything except MultiIndex), columns.values returns the names (str, not tuples)
  • depending on how you used .agg() you may need to keep the bottom-most label for a column or concatenate multiple labels
  • (since I’m new to pandas?) more often than not, I want reset_index() to be able to work with the group-by columns in the regular way, so it does that by default

回答 9

处理多个级别和混合类型的常规解决方案:

df.columns = ['_'.join(tuple(map(str, t))) for t in df.columns.values]

A general solution that handles multiple levels and mixed types:

df.columns = ['_'.join(tuple(map(str, t))) for t in df.columns.values]

回答 10

也许有些晚,但是如果您不担心重复的列名:

df.columns = df.columns.tolist()

A bit late maybe, but if you are not worried about duplicate column names:

df.columns = df.columns.tolist()

回答 11

如果您希望在各个级别之间使用分隔符,则此功能会很好用。

def flattenHierarchicalCol(col,sep = '_'):
    if not type(col) is tuple:
        return col
    else:
        new_col = ''
        for leveli,level in enumerate(col):
            if not level == '':
                if not leveli == 0:
                    new_col += sep
                new_col += level
        return new_col

df.columns = df.columns.map(flattenHierarchicalCol)

In case you want to have a separator in the name between levels, this function works well.

def flattenHierarchicalCol(col,sep = '_'):
    if not type(col) is tuple:
        return col
    else:
        new_col = ''
        for leveli,level in enumerate(col):
            if not level == '':
                if not leveli == 0:
                    new_col += sep
                new_col += level
        return new_col

df.columns = df.columns.map(flattenHierarchicalCol)

回答 12

在@jxstanford和@ tvt173之后,我编写了一个快速函数,无论字符串/ int列名如何,该函数都可以完成此任务:

def flatten_cols(df):
    df.columns = [
        '_'.join(tuple(map(str, t))).rstrip('_') 
        for t in df.columns.values
        ]
    return df

Following @jxstanford and @tvt173, I wrote a quick function which should do the trick, regardless of string/int column names:

def flatten_cols(df):
    df.columns = [
        '_'.join(tuple(map(str, t))).rstrip('_') 
        for t in df.columns.values
        ]
    return df

回答 13

您也可以按照以下步骤进行操作。考虑df是您的数据框,并假设一个二级索引(在您的示例中就是这种情况)

df.columns = [(df.columns[i][0])+'_'+(datadf_pos4.columns[i][1]) for i in range(len(df.columns))]

You could also do as below. Consider df to be your dataframe and assume a two level index (as is the case in your example)

df.columns = [(df.columns[i][0])+'_'+(datadf_pos4.columns[i][1]) for i in range(len(df.columns))]

回答 14

我将分享一种对我有用的简单方法。

[" ".join([str(elem) for elem in tup]) for tup in df.columns.tolist()]
#df = df.reset_index() if needed

I’ll share a straight-forward way that worked for me.

[" ".join([str(elem) for elem in tup]) for tup in df.columns.tolist()]
#df = df.reset_index() if needed

回答 15

要在其他DataFrame方法链内展平MultiIndex,请定义如下函数:

def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

然后使用该pipe方法在DataFrame方法链中,在链中任何其他方法之后groupbyagg之前应用此函数:

my_df \
  .groupby('group') \
  .agg({'value': ['count']}) \
  .pipe(flatten_index) \
  .sort_values('value_count')

To flatten a MultiIndex inside a chain of other DataFrame methods, define a function like this:

def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

Then use the pipe method to apply this function in the chain of DataFrame methods, after groupby and agg but before any other methods in the chain:

my_df \
  .groupby('group') \
  .agg({'value': ['count']}) \
  .pipe(flatten_index) \
  .sort_values('value_count')

回答 16

另一个简单的例程。

def flatten_columns(df, sep='.'):
    def _remove_empty(column_name):
        return tuple(element for element in column_name if element)
    def _join(column_name):
        return sep.join(column_name)

    new_columns = [_join(_remove_empty(column)) for column in df.columns.values]
    df.columns = new_columns

Another simple routine.

def flatten_columns(df, sep='.'):
    def _remove_empty(column_name):
        return tuple(element for element in column_name if element)
    def _join(column_name):
        return sep.join(column_name)

    new_columns = [_join(_remove_empty(column)) for column in df.columns.values]
    df.columns = new_columns

iloc,ix和loc有何不同?

问题:iloc,ix和loc有何不同?

有人可以解释这三种切片方法有何不同吗?
我看过文档,也看过这些 答案,但仍然发现自己无法解释这三者之间的区别。在我看来,它们在很大程度上似乎是可互换的,因为它们处于切片的较低级别。

例如,假设我们要获取的前五行DataFrame。这三者如何运作?

df.loc[:5]
df.ix[:5]
df.iloc[:5]

有人可以提出三种用法之间的区别更清楚的情况吗?

Can someone explain how these three methods of slicing are different?
I’ve seen the docs, and I’ve seen these answers, but I still find myself unable to explain how the three are different. To me, they seem interchangeable in large part, because they are at the lower levels of slicing.

For example, say we want to get the first five rows of a DataFrame. How is it that all three of these work?

df.loc[:5]
df.ix[:5]
df.iloc[:5]

Can someone present three cases where the distinction in uses are clearer?


回答 0

注意:在熊猫版本0.20.0及更高版本中,ix弃用,建议改为使用lociloc。我留下了ix完整的答案部分,以供早期版本的熊猫用户参考。下面添加了示例,显示了的替代方案 ix


首先,以下是三种方法的概述:

  • loc从索引中获取带有特定标签的行(或列)。
  • iloc在索引中的特定位置获取行(或列)(因此仅获取整数)。
  • ix通常会尝试表现得像,lociloc如果索引中没有标签,则会回落为行为。

重要的是要注意一些细微之处,这些细微之处可能会使ix使用起来有些棘手:

  • 如果索引是整数类型,ix则将仅使用基于标签的索引,而不会使用基于位置的索引。如果标签不在索引中,则会引发错误。

  • 如果指数不包含唯一整数,然后给出一个整数,ix将立即使用基于位置的索引,而不是基于标签的索引。但是,如果ix给定其他类型(例如字符串),则可以使用基于标签的索引。


为了说明这三种方法之间的差异,请考虑以下系列:

>>> s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>> s
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN

我们将看看用整数值切片3

在这种情况下,向s.iloc[:3]我们返回前3行(因为它将3视为位置),并向s.loc[:3]我们返回前8行(因为将3视为标签):

>>> s.iloc[:3] # slice the first three rows
49   NaN
48   NaN
47   NaN

>>> s.loc[:3] # slice up to and including label 3
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN

>>> s.ix[:3] # the integer is in the index so s.ix[:3] works like loc
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN

注意s.ix[:3]s.loc[:3]由于它首先查找标签,而不是在位置上工作(因此,其索引为s整数类型),因此Notification 返回相同的Series 。

如果我们尝试使用不在索引中的整数标签(例如6)怎么办?

此处s.iloc[:6]按预期返回Series的前6行。但是,s.loc[:6]由于6不在索引中,所以引发KeyError 。

>>> s.iloc[:6]
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN

>>> s.loc[:6]
KeyError: 6

>>> s.ix[:6]
KeyError: 6

根据上面提到的细微之处,s.ix[:6]现在引发KeyError,因为它试图像在索引中loc找到一个那样工作,但找不到它6。因为我们的索引是整数类型,ix所以不会回落为iloc

但是,如果我们的索引为混合类型,则给定的整数ixiloc立即表现出来,而不是引发KeyError:

>>> s2 = pd.Series(np.nan, index=['a','b','c','d','e', 1, 2, 3, 4, 5])
>>> s2.index.is_mixed() # index is mix of different types
True
>>> s2.ix[:6] # now behaves like iloc given integer
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
1   NaN

请记住,ix它仍然可以接受非整数并表现为loc

>>> s2.ix[:'c'] # behaves like loc given non-integer
a   NaN
b   NaN
c   NaN

作为一般建议,如果您仅使用标签建立索引,或者仅使用整数位置建立索引,请坚持使用lociloc避免出现意外结果-请勿使用ix


结合基于位置和基于标签的索引

有时在给定DataFrame的情况下,您将需要为行和列混合使用标签和位置索引方法。

例如,考虑以下DataFrame。如何最好地将行切成“ c” 包括前四列?

>>> df = pd.DataFrame(np.nan, 
                      index=list('abcde'),
                      columns=['x','y','z', 8, 9])
>>> df
    x   y   z   8   9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN

在早期版本的pandas(0.20.0之前)中ix,您可以整齐地进行此操作-我们可以按标签对行进行切片,按位置对列进行切片(请注意,对于列,ix由于4不是列名,因此默认为基于位置的切片 ):

>>> df.ix[:'c', :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

在更高版本的熊猫中,我们可以使用iloc并借助另一种方法来获得此结果:

>>> df.iloc[:df.index.get_loc('c') + 1, :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

get_loc()是一种索引方法,意思是“获取标签在此索引中的位置”。请注意,由于切片与iloc不包含其端点,因此如果还要行’c’,则必须在此值上加1。

此处的熊猫文档中还有其他示例。

Note: in pandas version 0.20.0 and above, ix is deprecated and the use of loc and iloc is encouraged instead. I have left the parts of this answer that describe ix intact as a reference for users of earlier versions of pandas. Examples have been added below showing alternatives to ix.


First, here’s a recap of the three methods:

  • loc gets rows (or columns) with particular labels from the index.
  • iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
  • ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index.

It’s important to note some subtleties that can make ix slightly tricky to use:

  • if the index is of integer type, ix will only use label-based indexing and not fall back to position-based indexing. If the label is not in the index, an error is raised.

  • if the index does not contain only integers, then given an integer, ix will immediately use position-based indexing rather than label-based indexing. If however ix is given another type (e.g. a string), it can use label-based indexing.


To illustrate the differences between the three methods, consider the following Series:

>>> s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>> s
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN

We’ll look at slicing with the integer value 3.

In this case, s.iloc[:3] returns us the first 3 rows (since it treats 3 as a position) and s.loc[:3] returns us the first 8 rows (since it treats 3 as a label):

>>> s.iloc[:3] # slice the first three rows
49   NaN
48   NaN
47   NaN

>>> s.loc[:3] # slice up to and including label 3
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN

>>> s.ix[:3] # the integer is in the index so s.ix[:3] works like loc
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN

Notice s.ix[:3] returns the same Series as s.loc[:3] since it looks for the label first rather than working on the position (and the index for s is of integer type).

What if we try with an integer label that isn’t in the index (say 6)?

Here s.iloc[:6] returns the first 6 rows of the Series as expected. However, s.loc[:6] raises a KeyError since 6 is not in the index.

>>> s.iloc[:6]
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN

>>> s.loc[:6]
KeyError: 6

>>> s.ix[:6]
KeyError: 6

As per the subtleties noted above, s.ix[:6] now raises a KeyError because it tries to work like loc but can’t find a 6 in the index. Because our index is of integer type ix doesn’t fall back to behaving like iloc.

If, however, our index was of mixed type, given an integer ix would behave like iloc immediately instead of raising a KeyError:

>>> s2 = pd.Series(np.nan, index=['a','b','c','d','e', 1, 2, 3, 4, 5])
>>> s2.index.is_mixed() # index is mix of different types
True
>>> s2.ix[:6] # now behaves like iloc given integer
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
1   NaN

Keep in mind that ix can still accept non-integers and behave like loc:

>>> s2.ix[:'c'] # behaves like loc given non-integer
a   NaN
b   NaN
c   NaN

As general advice, if you’re only indexing using labels, or only indexing using integer positions, stick with loc or iloc to avoid unexpected results – try not use ix.


Combining position-based and label-based indexing

Sometimes given a DataFrame, you will want to mix label and positional indexing methods for the rows and columns.

For example, consider the following DataFrame. How best to slice the rows up to and including ‘c’ and take the first four columns?

>>> df = pd.DataFrame(np.nan, 
                      index=list('abcde'),
                      columns=['x','y','z', 8, 9])
>>> df
    x   y   z   8   9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN

In earlier versions of pandas (before 0.20.0) ix lets you do this quite neatly – we can slice the rows by label and the columns by position (note that for the columns, ix will default to position-based slicing since 4 is not a column name):

>>> df.ix[:'c', :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

In later versions of pandas, we can achieve this result using iloc and the help of another method:

>>> df.iloc[:df.index.get_loc('c') + 1, :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

get_loc() is an index method meaning “get the position of the label in this index”. Note that since slicing with iloc is exclusive of its endpoint, we must add 1 to this value if we want row ‘c’ as well.

There are further examples in pandas’ documentation here.


回答 1

iloc基于整数定位工作。因此,无论您的行标签是什么,您都可以始终执行以下操作:

df.iloc[0]

或最后五行

df.iloc[-5:]

您也可以在列上使用它。这将检索第三列:

df.iloc[:, 2]    # the : in the first position indicates all rows

您可以将它们结合起来以获得行和列的交集:

df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)

另一方面,.loc使用命名索引。让我们设置一个带有字符串作为行和列标签的数据框:

df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])

然后我们可以得到第一行

df.loc['a']     # equivalent to df.iloc[0]

和第二两排的'date'柱通过

df.loc['b':, 'date']   # equivalent to df.iloc[1:, 1]

等等。现在,可能值得指出的是,a的默认行和列索引DataFrame是从0开始的整数,在这种情况下iloc,它们的loc工作方式相同。这就是为什么您的三个示例是等效的。如果您有非数字索引(例如字符串或日期时间), df.loc[:5] 则会引发错误。

另外,您可以仅使用数据框的进行列检索__getitem__

df['time']    # equivalent to df.loc[:, 'time']

现在假设您要混合使用位置索引和命名索引,即使用行上的名称和列上的位置进行索引(为澄清起见,我的意思是从我们的数据框中选择内容,而不是使用行索引中包含字符串和整数的方式创建数据框列索引)。这是.ix进来的地方:

df.ix[:2, 'time']    # the first two rows of the 'time' column

我认为也值得一提的是,您也可以将布尔向量传递给该loc方法。例如:

 b = [True, False, True]
 df.loc[b] 

将返回的第一行和第三行df。这等效df[b]于选择,但也可以用于通过布尔向量进行分配:

df.loc[b, 'name'] = 'Mary', 'John'

iloc works based on integer positioning. So no matter what your row labels are, you can always, e.g., get the first row by doing

df.iloc[0]

or the last five rows by doing

df.iloc[-5:]

You can also use it on the columns. This retrieves the 3rd column:

df.iloc[:, 2]    # the : in the first position indicates all rows

You can combine them to get intersections of rows and columns:

df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)

On the other hand, .loc use named indices. Let’s set up a data frame with strings as row and column labels:

df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])

Then we can get the first row by

df.loc['a']     # equivalent to df.iloc[0]

and the second two rows of the 'date' column by

df.loc['b':, 'date']   # equivalent to df.iloc[1:, 1]

and so on. Now, it’s probably worth pointing out that the default row and column indices for a DataFrame are integers from 0 and in this case iloc and loc would work in the same way. This is why your three examples are equivalent. If you had a non-numeric index such as strings or datetimes, df.loc[:5] would raise an error.

Also, you can do column retrieval just by using the data frame’s __getitem__:

df['time']    # equivalent to df.loc[:, 'time']

Now suppose you want to mix position and named indexing, that is, indexing using names on rows and positions on columns (to clarify, I mean select from our data frame, rather than creating a data frame with strings in the row index and integers in the column index). This is where .ix comes in:

df.ix[:2, 'time']    # the first two rows of the 'time' column

I think it’s also worth mentioning that you can pass boolean vectors to the loc method as well. For example:

 b = [True, False, True]
 df.loc[b] 

Will return the 1st and 3rd rows of df. This is equivalent to df[b] for selection, but it can also be used for assigning via boolean vectors:

df.loc[b, 'name'] = 'Mary', 'John'

回答 2

我认为,可接受的答案令人困惑,因为它使用仅缺少值的DataFrame。我也不喜欢术语基于位置.iloc,相反,喜欢整数位置,因为它是更描述性,正是.iloc代表。关键字是.ilocINTEGER-需要INTEGERS。

请参阅我关于子集选择的非常详细的博客系列,以了解更多信息


.ix已弃用且含糊不清,切勿使用

由于.ix已弃用,因此我们仅关注.loc和之间的差异.iloc

在讨论差异之前,重要的是要了解DataFrames具有标签,这些标签可帮助标识每个列和每个索引。让我们看一个示例DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

在此处输入图片说明

所有粗体字均为标签。标签,agecolorfoodheightscorestate被用于。其他标签,JaneNickAaronPenelopeDeanChristinaCornelia被用于索引


在DataFrame中选择特定行的主要方法是使用.loc.iloc索引器。这些索引器中的每一个也可以用于同时选择列,但是现在只关注行更容易。同样,每个索引器都使用紧跟其名称的一组括号进行选择。

.loc仅通过标签选择数据

我们将首先讨论.loc仅通过索引或列标签选择数据的索引器。在示例DataFrame中,我们提供了有意义的名称作为索引值。许多DataFrame都没有任何有意义的名称,而是默认为0到n-1之间的整数,其中n是DataFrame的长度。

您可以使用三种不同的输入 .loc

  • 一串
  • 字符串列表
  • 使用字符串作为起始值和终止值的切片符号

用带字符串的.loc选择单行

要选择一行数据,请将索引标签放在后面的括号内.loc

df.loc['Penelope']

这将数据行作为系列返回

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

使用.loc与字符串列表选择多行

df.loc[['Cornelia', 'Jane', 'Dean']]

这将返回一个DataFrame,其中的数据行按列表中指定的顺序进行:

在此处输入图片说明

使用带有切片符号的.loc选择多行

切片符号由开始,停止和步进值定义。按标签切片时,大熊猫在返回值中包含停止值。以下是从亚伦到迪恩(含)的片段。它的步长未明确定义,但默认为1。

df.loc['Aaron':'Dean']

在此处输入图片说明

可以采用与Python列表相同的方式获取复杂的切片。

.iloc仅按整数位置选择数据

现在转到.iloc。DataFrame中数据的每一行和每一列都有一个定义它的整数位置。这是在输出中直观显示的标签的补充。整数位置只是从0开始从顶部/左侧开始的行/列数。

您可以使用三种不同的输入 .iloc

  • 一个整数
  • 整数列表
  • 使用整数作为起始值和终止值的切片符号

用带整数的.iloc选择单行

df.iloc[4]

这将返回第5行(整数位置4)为系列

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

用.iloc选择带有整数列表的多行

df.iloc[[2, -2]]

这将返回第三行和倒数第二行的DataFrame:

在此处输入图片说明

使用带切片符号的.iloc选择多行

df.iloc[:5:3]

在此处输入图片说明


使用.loc和.iloc同时选择行和列

两者的一项出色功能.loc/.iloc是它们可以同时选择行和列。在上面的示例中,所有列都是从每个选择中返回的。我们可以选择输入类型与行相同的列。我们只需要用逗号分隔行和列选择即可。

例如,我们可以选择Jane行和Dean行,它们的高度,得分和状态如下:

df.loc[['Jane', 'Dean'], 'height':]

在此处输入图片说明

这对行使用标签列表,对列使用切片符号

我们自然可以.iloc只使用整数来执行类似的操作。

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

带标签和整数位置的同时选择

.ix用来与标签和整数位置同时进行选择,这很有用,但有时会造成混淆和模棱两可,值得庆幸的是,它已被弃用。如果您需要混合使用标签和整数位置进行选择,则必须同时选择标签或整数位置。

例如,如果我们要选择行Nick以及第Cornelia2列和第4列,则可以.loc通过以下方式将整数转换为标签来使用:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

或者,可以使用get_locindex方法将索引标签转换为整数。

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

布尔选择

.loc索引器还可以进行布尔选择。例如,如果我们有兴趣查找年龄在30岁以上的所有行,并仅返回foodscore列,则可以执行以下操作:

df.loc[df['age'] > 30, ['food', 'score']] 

您可以使用复制它,.iloc但是不能将其传递为布尔系列。您必须将boolean Series转换为numpy数组,如下所示:

df.iloc[(df['age'] > 30).values, [2, 4]] 

选择所有行

可以.loc/.iloc仅用于列选择。您可以使用如下冒号来选择所有行:

df.loc[:, 'color':'score':2]

在此处输入图片说明


索引运算符[]可以选择行和列,但不能同时选择。

大多数人都熟悉DataFrame索引运算符的主要目的,即选择列。字符串选择单个列作为系列,而字符串列表选择多个列作为DataFrame。

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

使用列表选择多个列

df[['food', 'score']]

在此处输入图片说明

人们所不熟悉的是,当使用切片符号时,选择是通过行标签或整数位置进行的。这非常令人困惑,我几乎从未使用过,但是确实可以使用。

df['Penelope':'Christina'] # slice rows by label

在此处输入图片说明

df[2:6:2] # slice rows by integer location

在此处输入图片说明

.loc/.iloc选择行的显式性是高度首选的。单独的索引运算符无法同时选择行和列。

df[3:5, 'color']
TypeError: unhashable type: 'slice'

In my opinion, the accepted answer is confusing, since it uses a DataFrame with only missing values. I also do not like the term position-based for .iloc and instead, prefer integer location as it is much more descriptive and exactly what .iloc stands for. The key word is INTEGER – .iloc needs INTEGERS.

See my extremely detailed blog series on subset selection for more


.ix is deprecated and ambiguous and should never be used

Because .ix is deprecated we will only focus on the differences between .loc and .iloc.

Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each index. Let’s take a look at a sample DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

enter image description here

All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used for the index.


The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.

.loc selects data only by labels

We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length of the DataFrame.

There are three different inputs you can use for .loc

  • A string
  • A list of strings
  • Slice notation using strings as the start and stop values

Selecting a single row with .loc with a string

To select a single row of data, place the index label inside of the brackets following .loc.

df.loc['Penelope']

This returns the row of data as a Series

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

Selecting multiple rows with .loc with a list of strings

df.loc[['Cornelia', 'Jane', 'Dean']]

This returns a DataFrame with the rows in the order specified in the list:

enter image description here

Selecting multiple rows with .loc with slice notation

Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.

df.loc['Aaron':'Dean']

enter image description here

Complex slices can be taken in the same manner as Python lists.

.iloc selects data only by integer location

Let’s now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.

There are three different inputs you can use for .iloc

  • An integer
  • A list of integers
  • Slice notation using integers as the start and stop values

Selecting a single row with .iloc with an integer

df.iloc[4]

This returns the 5th row (integer location 4) as a Series

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

Selecting multiple rows with .iloc with a list of integers

df.iloc[[2, -2]]

This returns a DataFrame of the third and second to last rows:

enter image description here

Selecting multiple rows with .iloc with slice notation

df.iloc[:5:3]

enter image description here


Simultaneous selection of rows and columns with .loc and .iloc

One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.

For example, we can select rows Jane, and Dean with just the columns height, score and state like this:

df.loc[['Jane', 'Dean'], 'height':]

enter image description here

This uses a list of labels for the rows and slice notation for the columns

We can naturally do similar operations with .iloc using only integers.

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

Simultaneous selection with labels and integer location

.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.

For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

Or alternatively, convert the index labels to integers with the get_loc index method.

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

Boolean Selection

The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows wher age is above 30 and return just the food and score columns we can do the following:

df.loc[df['age'] > 30, ['food', 'score']] 

You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:

df.iloc[(df['age'] > 30).values, [2, 4]] 

Selecting all rows

It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:

df.loc[:, 'color':'score':2]

enter image description here


The indexing operator, [], can select rows and columns too but not simultaneously.

Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

Using a list selects multiple columns

df[['food', 'score']]

enter image description here

What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.

df['Penelope':'Christina'] # slice rows by label

enter image description here

df[2:6:2] # slice rows by integer location

enter image description here

The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.

df[3:5, 'color']
TypeError: unhashable type: 'slice'

如何处理熊猫中的SettingWithCopyWarning?

问题:如何处理熊猫中的SettingWithCopyWarning?

背景

我刚刚将熊猫从0.11升级到0.13.0rc1。现在,该应用程序弹出了许多新警告。其中之一是这样的:

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE

我想知道到底是什么意思?我需要改变什么吗?

如果我坚持使用该如何警告quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE

产生错误的功能

def _decode_stock_quote(list_of_150_stk_str):
    """decode the webpage and return dataframe"""

    from cStringIO import StringIO

    str_of_all = "".join(list_of_150_stk_str)

    quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
    quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
    quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
    quote_df['TClose'] = quote_df['TPrice']
    quote_df['RT']     = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)
    quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
    quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
    quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)
    quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')
    quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

    return quote_df

更多错误讯息

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

Background

I just upgraded my Pandas from 0.11 to 0.13.0rc1. Now, the application is popping out many new warnings. One of them like this:

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE

I want to know what exactly it means? Do I need to change something?

How should I suspend the warning if I insist to use quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE?

The function that gives errors

def _decode_stock_quote(list_of_150_stk_str):
    """decode the webpage and return dataframe"""

    from cStringIO import StringIO

    str_of_all = "".join(list_of_150_stk_str)

    quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
    quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
    quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
    quote_df['TClose'] = quote_df['TPrice']
    quote_df['RT']     = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)
    quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
    quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
    quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)
    quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')
    quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

    return quote_df

More error messages

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

回答 0

SettingWithCopyWarning被创造的标志可能造成混淆的“链接”的任务,比如下面这并不总是按预期方式工作,特别是当第一选择返回一个副本。[ 有关背景讨论,请参见GH5390GH5597。]

df[df['A'] > 2]['B'] = new_val  # new_val not set in df

该警告提出了如下重写建议:

df.loc[df['A'] > 2, 'B'] = new_val

但是,这不适合您的用法,相当于:

df = df[df['A'] > 2]
df['B'] = new_val

显然,您不关心将其写回到原始帧的写操作(因为您正在覆盖对它的引用),但是不幸的是,这种模式无法与第一个链式分配示例区分开。因此,(误报)警告。如果您想进一步阅读,可能会在建立索引文档中解决误报的可能性。您可以通过以下分配安全地禁用此新警告。

import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

The SettingWithCopyWarning was created to flag potentially confusing “chained” assignments, such as the following, which does not always work as expected, particularly when the first selection returns a copy. [see GH5390 and GH5597 for background discussion.]

df[df['A'] > 2]['B'] = new_val  # new_val not set in df

The warning offers a suggestion to rewrite as follows:

df.loc[df['A'] > 2, 'B'] = new_val

However, this doesn’t fit your usage, which is equivalent to:

df = df[df['A'] > 2]
df['B'] = new_val

While it’s clear that you don’t care about writes making it back to the original frame (since you are overwriting the reference to it), unfortunately this pattern cannot be differentiated from the first chained assignment example. Hence the (false positive) warning. The potential for false positives is addressed in the docs on indexing, if you’d like to read further. You can safely disable this new warning with the following assignment.

import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

回答 1

SettingWithCopyWarning熊猫如何应对?

这篇文章的读者对象是:

  1. 想了解此警告的含义
  2. 想了解抑制此警告的不同方法
  3. 想了解如何改进其代码并遵循良好做法,以避免将来出现此警告。

设定

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df
   A  B  C  D  E
0  5  0  3  3  7
1  9  3  5  2  4
2  7  6  8  8  1

什么是SettingWithCopyWarning

要知道如何处理此警告,重要的是要理解它的含义以及为什么首先提出它。

过滤DataFrame时,可以对帧进行切片/索引以返回一个视图copy,具体取决于内部布局和各种实现细节。顾名思义,“视图”是原始数据的视图,因此修改视图可能会修改原始对象。另一方面,“副本”是原始数据的复制,修改副本不会影响原始数据。

如其他答案所述SettingWithCopyWarning,创建时会标记“链接分配”操作。df在上面的设置中考虑。假设您要选择“ B”列中的所有值,其中“ A”列中的值>5。Pandas允许您以不同的方式执行此操作,其中某些方法比其他方法更正确。例如,

df[df.A > 5]['B']

1    3
2    6
Name: B, dtype: int64

和,

df.loc[df.A > 5, 'B']

1    3
2    6
Name: B, dtype: int64

这些返回相同的结果,因此,如果您仅读取这些值,则没有区别。那么,问题是什么呢?链式分配的问题在于,通常很难预测是否返回视图或副本,因此在尝试分配回值时这在很大程度上成为一个问题。为了建立在前面的示例上,请考虑解释器如何执行此代码:

df.loc[df.A > 5, 'B'] = 4
# becomes
df.__setitem__((df.A > 5, 'B'), 4)

只需__setitem__调用一次即可df。OTOH,请考虑以下代码:

df[df.A > 5]['B'] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__('B", 4)

现在,根据__getitem__返回的视图还是副本,__setitem__操作可能不起作用

通常,您应将其loc用于基于标签的分配以及iloc基于整数/位置的分配,因为该规范保证它们始终在原始文件上运行。此外,要设置单个单元格,应使用atiat

可以在文档中找到更多信息

注意使用进行的
所有布尔索引操作loc也可以使用进行iloc。唯一的区别是iloc期望索引的整数/位置或布尔值的numpy数组,以及列的整数/位置索引。

例如,

df.loc[df.A > 5, 'B'] = 4

可以写成nas

df.iloc[(df.A > 5).values, 1] = 4

和,

df.loc[1, 'A'] = 100

可以写成

df.iloc[1, 0] = 100

等等。


告诉我如何抑制警告!

考虑对的“ A”列进行的简单操作df。选择“ A”并除以2将发出警告,但该操作将起作用。

df2 = df[['A']]
df2['A'] /= 2
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

df2
     A
0  2.5
1  4.5
2  3.5

有两种方法可以直接静默此警告:

  1. 做一个 deepcopy

    df2 = df[['A']].copy(deep=True)
    df2['A'] /= 2
  2. 更改pd.options.mode.chained_assignment
    可以设置为None"warn""raise""warn"是默认值。None将完全抑制警告,并"raise"抛出SettingWithCopyError,阻止操作进行。

    pd.options.mode.chained_assignment = None
    df2['A'] /= 2

@Peter Cotton在评论中提出了一种不错的方法,即使用上下文管理器以非侵入方式更改模式(从此要点修改),仅在需要时才设置模式,然后将其重置为完成后的原始状态。

class ChainedAssignent:
    def __init__(self, chained=None):
        acceptable = [None, 'warn', 'raise']
        assert chained in acceptable, "chained must be in " + str(acceptable)
        self.swcw = chained

    def __enter__(self):
        self.saved_swcw = pd.options.mode.chained_assignment
        pd.options.mode.chained_assignment = self.swcw
        return self

    def __exit__(self, *args):
        pd.options.mode.chained_assignment = self.saved_swcw

用法如下:

# some code here
with ChainedAssignent():
    df2['A'] /= 2
# more code follows

或者,引发异常

with ChainedAssignent(chained='raise'):
    df2['A'] /= 2

SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

“ XY问题”:我在做什么错?

很多时候,用户试图寻找抑制此异常的方法而没有完全理解为什么首先出现该异常。这是XY问题的一个很好的示例,用户尝试解决问题“ Y”,这实际上是根源问题“ X”的症状。将根据遇到此警告的常见问题提出问题,然后提出解决方案。

问题1
我有一个DataFrame

df
       A  B  C  D  E
    0  5  0  3  3  7
    1  9  3  5  2  4
    2  7  6  8  8  1

我想为“ A”> 5到1000分配值。我的预期输出是

      A  B  C  D  E
0     5  0  3  3  7
1  1000  3  5  2  4
2  1000  6  8  8  1

错误的方法:

df.A[df.A > 5] = 1000         # works, because df.A returns a view
df[df.A > 5]['A'] = 1000      # does not work
df.loc[df.A  5]['A'] = 1000   # does not work

正确使用方法loc

df.loc[df.A > 5, 'A'] = 1000


问题2 1
我正在尝试将单元格(1,’D’)中的值设置为12345。我的预期输出是

   A  B  C      D  E
0  5  0  3      3  7
1  9  3  5  12345  4
2  7  6  8      8  1

我尝试了多种访问此单元格的方法,例如 df['D'][1]。做这个的最好方式是什么?

1.这个问题与警告并不特别相关,但是最好了解如何正确执行此特定操作,以避免将来可能出现警告的情况。

您可以使用以下任何一种方法来执行此操作。

df.loc[1, 'D'] = 12345
df.iloc[1, 3] = 12345
df.at[1, 'D'] = 12345
df.iat[1, 3] = 12345


问题3
我试图根据某些条件对值进行子集化。我有一个DataFrame

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

我想将“ D”中的值分配给123,以使“ C” ==5。我尝试过

df2.loc[df2.C == 5, 'D'] = 123

看起来不错,但我仍然可以 SettingWithCopyWarning!我该如何解决?

实际上,这可能是因为您的管道中的代码更高。您是否df2从更大的事物(例如

df2 = df[df.A > 5]

?在这种情况下,布尔索引将返回一个视图,因此df2将引用原始视图。您需要做的就是分配df2一个副本

df2 = df[df.A > 5].copy()
# Or,
# df2 = df.loc[df.A > 5, :]


问题4
我试图将列“ C”从

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

但是使用

df2.drop('C', axis=1, inplace=True)

抛出SettingWithCopyWarning。为什么会这样呢?

这是因为df2必须已通过其他切片操作将其创建为视图,例如

df2 = df[df.A > 5]

这里的解决方案是要么做copy()df,或使用loc,如前。

How to deal with SettingWithCopyWarning in Pandas?

This post is meant for readers who,

  1. Would like to understand what this warning means
  2. Would like to understand different ways of suppressing this warning
  3. Would like to understand how to improve their code and follow good practices to avoid this warning in the future.

Setup

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df
   A  B  C  D  E
0  5  0  3  3  7
1  9  3  5  2  4
2  7  6  8  8  1

What is the SettingWithCopyWarning?

To know how to deal with this warning, it is important to understand what it means and why it is raised in the first place.

When filtering DataFrames, it is possible slice/index a frame to return either a view, or a copy, depending on the internal layout and various implementation details. A “view” is, as the term suggests, a view into the original data, so modifying the view may modify the original object. On the other hand, a “copy” is a replication of data from the original, and modifying the copy has no effect on the original.

As mentioned by other answers, the SettingWithCopyWarning was created to flag “chained assignment” operations. Consider df in the setup above. Suppose you would like to select all values in column “B” where values in column “A” is > 5. Pandas allows you to do this in different ways, some more correct than others. For example,

df[df.A > 5]['B']

1    3
2    6
Name: B, dtype: int64

And,

df.loc[df.A > 5, 'B']

1    3
2    6
Name: B, dtype: int64

These return the same result, so if you are only reading these values, it makes no difference. So, what is the issue? The problem with chained assignment, is that it is generally difficult to predict whether a view or a copy is returned, so this largely becomes an issue when you are attempting to assign values back. To build on the earlier example, consider how this code is executed by the interpreter:

df.loc[df.A > 5, 'B'] = 4
# becomes
df.__setitem__((df.A > 5, 'B'), 4)

With a single __setitem__ call to df. OTOH, consider this code:

df[df.A > 5]['B'] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__('B", 4)

Now, depending on whether __getitem__ returned a view or a copy, the __setitem__ operation may not work.

In general, you should use loc for label-based assignment, and iloc for integer/positional based assignment, as the spec guarantees that they always operate on the original. Additionally, for setting a single cell, you should use at and iat.

More can be found in the documentation.

Note
All boolean indexing operations done with loc can also be done with iloc. The only difference is that iloc expects either integers/positions for index or a numpy array of boolean values, and integer/position indexes for the columns.

For example,

df.loc[df.A > 5, 'B'] = 4

Can be written nas

df.iloc[(df.A > 5).values, 1] = 4

And,

df.loc[1, 'A'] = 100

Can be written as

df.iloc[1, 0] = 100

And so on.


Just tell me how to suppress the warning!

Consider a simple operation on the “A” column of df. Selecting “A” and dividing by 2 will raise the warning, but the operation will work.

df2 = df[['A']]
df2['A'] /= 2
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

df2
     A
0  2.5
1  4.5
2  3.5

There are a couple ways of directly silencing this warning:

  1. Make a deepcopy

    df2 = df[['A']].copy(deep=True)
    df2['A'] /= 2
    
  2. Change pd.options.mode.chained_assignment
    Can be set to None, "warn", or "raise". "warn" is the default. None will suppress the warning entirely, and "raise" will throw a SettingWithCopyError, preventing the operation from going through.

    pd.options.mode.chained_assignment = None
    df2['A'] /= 2
    

@Peter Cotton in the comments, came up with a nice way of non-intrusively changing the mode (modified from this gist) using a context manager, to set the mode only as long as it is required, and the reset it back to the original state when finished.

class ChainedAssignent:
    def __init__(self, chained=None):
        acceptable = [None, 'warn', 'raise']
        assert chained in acceptable, "chained must be in " + str(acceptable)
        self.swcw = chained

    def __enter__(self):
        self.saved_swcw = pd.options.mode.chained_assignment
        pd.options.mode.chained_assignment = self.swcw
        return self

    def __exit__(self, *args):
        pd.options.mode.chained_assignment = self.saved_swcw

The usage is as follows:

# some code here
with ChainedAssignent():
    df2['A'] /= 2
# more code follows

Or, to raise the exception

with ChainedAssignent(chained='raise'):
    df2['A'] /= 2

SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

The “XY Problem”: What am I doing wrong?

A lot of the time, users attempt to look for ways of suppressing this exception without fully understanding why it was raised in the first place. This is a good example of an XY problem, where users attempt to solve a problem “Y” that is actually a symptom of a deeper rooted problem “X”. Questions will be raised based on common problems that encounter this warning, and solutions will then be presented.

Question 1
I have a DataFrame

df
       A  B  C  D  E
    0  5  0  3  3  7
    1  9  3  5  2  4
    2  7  6  8  8  1

I want to assign values in col “A” > 5 to 1000. My expected output is

      A  B  C  D  E
0     5  0  3  3  7
1  1000  3  5  2  4
2  1000  6  8  8  1

Wrong way to do this:

df.A[df.A > 5] = 1000         # works, because df.A returns a view
df[df.A > 5]['A'] = 1000      # does not work
df.loc[df.A  5]['A'] = 1000   # does not work

Right way using loc:

df.loc[df.A > 5, 'A'] = 1000


Question 21
I am trying to set the value in cell (1, ‘D’) to 12345. My expected output is

   A  B  C      D  E
0  5  0  3      3  7
1  9  3  5  12345  4
2  7  6  8      8  1

I have tried different ways of accessing this cell, such as df['D'][1]. What is the best way to do this?

1. This question isn’t specifically related to the warning, but it is good to understand how to do this particular operation correctly so as to avoid situations where the warning could potentially arise in future.

You can use any of the following methods to do this.

df.loc[1, 'D'] = 12345
df.iloc[1, 3] = 12345
df.at[1, 'D'] = 12345
df.iat[1, 3] = 12345


Question 3
I am trying to subset values based on some condition. I have a DataFrame

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

I would like to assign values in “D” to 123 such that “C” == 5. I tried

df2.loc[df2.C == 5, 'D'] = 123

Which seems fine but I am still getting the SettingWithCopyWarning! How do I fix this?

This is actually probably because of code higher up in your pipeline. Did you create df2 from something larger, like

df2 = df[df.A > 5]

? In this case, boolean indexing will return a view, so df2 will reference the original. What you’d need to do is assign df2 to a copy:

df2 = df[df.A > 5].copy()
# Or,
# df2 = df.loc[df.A > 5, :]


Question 4
I’m trying to drop column “C” in-place from

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

But using

df2.drop('C', axis=1, inplace=True)

Throws SettingWithCopyWarning. Why is this happening?

This is because df2 must have been created as a view from some other slicing operation, such as

df2 = df[df.A > 5]

The solution here is to either make a copy() of df, or use loc, as before.


回答 2

通常,的目的SettingWithCopyWarning是向用户(尤其是新用户)显示他们可能正在使用副本,而不是他们认为的原始内容。这里误报(IOW如果你知道你在做什么,它可能是好的)。一种可能就是简单地关闭(默认警告按照@Garrett的建议)警告。

这是另一个选择:

In [1]: df = DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [2]: dfa = df.ix[:, [1, 0]]

In [3]: dfa.is_copy
Out[3]: True

In [4]: dfa['A'] /= 2
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  #!/usr/local/bin/python

您可以将is_copy标志设置为False,以有效关闭该对象的检查:

In [5]: dfa.is_copy = False

In [6]: dfa['A'] /= 2

如果您明确复制,则不会发生进一步的警告:

In [7]: dfa = df.ix[:, [1, 0]].copy()

In [8]: dfa['A'] /= 2

OP在上面显示的代码是合法的,并且可能是我也可以做的,但从技术上讲,此警告是一种情况,不是误报。没有警告的另一种方法是通过进行选择操作reindex,例如

quote_df = quote_df.reindex(columns=['STK', ...])

要么,

quote_df = quote_df.reindex(['STK', ...], axis=1)  # v.0.21

In general the point of the SettingWithCopyWarning is to show users (and especially new users) that they may be operating on a copy and not the original as they think. There are false positives (IOW if you know what you are doing it could be ok). One possibility is simply to turn off the (by default warn) warning as @Garrett suggest.

Here is another option:

In [1]: df = DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [2]: dfa = df.ix[:, [1, 0]]

In [3]: dfa.is_copy
Out[3]: True

In [4]: dfa['A'] /= 2
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  #!/usr/local/bin/python

You can set the is_copy flag to False, which will effectively turn off the check, for that object:

In [5]: dfa.is_copy = False

In [6]: dfa['A'] /= 2

If you explicitly copy then no further warning will happen:

In [7]: dfa = df.ix[:, [1, 0]].copy()

In [8]: dfa['A'] /= 2

The code the OP is showing above, while legitimate, and probably something I do as well, is technically a case for this warning, and not a false positive. Another way to not have the warning would be to do the selection operation via reindex, e.g.

quote_df = quote_df.reindex(columns=['STK', ...])

Or,

quote_df = quote_df.reindex(['STK', ...], axis=1)  # v.0.21

回答 3

熊猫数据框复制警告

当您去做这样的事情时:

quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

pandas.ix 在这种情况下将返回一个新的独立数据帧。

您决定在此数据框中更改的任何值都不会更改原始数据框。

这就是熊猫试图警告您的内容。


为什么 .ix是个坏主意

.ix对象试图做的事情不只一件事,而且对于任何阅读过干净代码的人来说,这是一种强烈的气味。

给定此数据框:

df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]})

两种行为:

dfcopy = df.ix[:,["a"]]
dfcopy.a.ix[0] = 2

行为一:dfcopy现在是一个独立的数据框。改变它不会改变df

df.ix[0, "a"] = 3

行为二:更改原始数据框。


使用.loc替代

熊猫开发者意识到该.ix对象很臭(推测地),因此创建了两个新对象,这些对象有助于数据的获取和分配。(另一个是.iloc

.loc 速度更快,因为它不会尝试创建数据副本。

.loc 旨在就地修改现有数据框,从而提高内存效率。

.loc 是可预测的,它具有一种行为。


解决方案

在代码示例中,您正在执行的操作是加载一个包含许多列的大文件,然后将其修改为较小的文件。

pd.read_csv功能可以帮助您解决很多问题,还可以加快文件的加载速度。

所以不要这样做

quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

做这个

columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime']
df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31])
df.columns = columns

这只会读取您感兴趣的列,并正确命名它们。无需使用邪恶的.ix物体做神奇的事情。

Pandas dataframe copy warning

When you go and do something like this:

quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

pandas.ix in this case returns a new, stand alone dataframe.

Any values you decide to change in this dataframe, will not change the original dataframe.

This is what pandas tries to warn you about.


Why .ix is a bad idea

The .ix object tries to do more than one thing, and for anyone who has read anything about clean code, this is a strong smell.

Given this dataframe:

df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]})

Two behaviors:

dfcopy = df.ix[:,["a"]]
dfcopy.a.ix[0] = 2

Behavior one: dfcopy is now a stand alone dataframe. Changing it will not change df

df.ix[0, "a"] = 3

Behavior two: This changes the original dataframe.


Use .loc instead

The pandas developers recognized that the .ix object was quite smelly[speculatively] and thus created two new objects which helps in the accession and assignment of data. (The other being .iloc)

.loc is faster, because it does not try to create a copy of the data.

.loc is meant to modify your existing dataframe inplace, which is more memory efficient.

.loc is predictable, it has one behavior.


The solution

What you are doing in your code example is loading a big file with lots of columns, then modifying it to be smaller.

The pd.read_csv function can help you out with a lot of this and also make the loading of the file a lot faster.

So instead of doing this

quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

Do this

columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime']
df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31])
df.columns = columns

This will only read the columns you are interested in, and name them properly. No need for using the evil .ix object to do magical stuff.


回答 4

在这里,我直接回答这个问题。怎么处理呢?

.copy(deep=False)切片后做一个。参见pandas.DataFrame.copy

等等,切片不返回副本吗?毕竟,这是警告消息要说的内容?阅读详细答案:

import pandas as pd
df = pd.DataFrame({'x':[1,2,3]})

这给出了警告:

df0 = df[df.x>2]
df0['foo'] = 'bar'

这不是:

df1 = df[df.x>2].copy(deep=False)
df1['foo'] = 'bar'

两者df0df1都是DataFrame对象,但它们之间的某些不同之处使熊猫能够打印警告。让我们找出它是什么。

import inspect
slice= df[df.x>2]
slice_copy = df[df.x>2].copy(deep=False)
inspect.getmembers(slice)
inspect.getmembers(slice_copy)

使用选择的差异工具,您将看到,除了几个地址之外,唯一的实质区别是:

|          | slice   | slice_copy |
| _is_copy | weakref | None       |

决定是否发出警告的方法是DataFrame._check_setitem_copy检查_is_copy。所以,你去。制作一个copy使您的DataFrame不_is_copy

建议使用警告.loc,但如果在上使用.loc该框架_is_copy,您仍会收到相同的警告。误导?是。烦人吗 你打赌 有帮助吗?可能在使用链式分配时。但是它不能正确检测链条分配,并且会随意打印警告。

Here I answer the question directly. How to deal with it?

Make a .copy(deep=False) after you slice. See pandas.DataFrame.copy.

Wait, doesn’t a slice return a copy? After all, this is what the warning message is attempting to say? Read the long answer:

import pandas as pd
df = pd.DataFrame({'x':[1,2,3]})

This gives a warning:

df0 = df[df.x>2]
df0['foo'] = 'bar'

This does not:

df1 = df[df.x>2].copy(deep=False)
df1['foo'] = 'bar'

Both df0 and df1 are DataFrame objects, but something about them is different that enables pandas to print the warning. Let’s find out what it is.

import inspect
slice= df[df.x>2]
slice_copy = df[df.x>2].copy(deep=False)
inspect.getmembers(slice)
inspect.getmembers(slice_copy)

Using your diff tool of choice, you will see that beyond a couple of addresses, the only material difference is this:

|          | slice   | slice_copy |
| _is_copy | weakref | None       |

The method that decides whether to warn is DataFrame._check_setitem_copy which checks _is_copy. So here you go. Make a copy so that your DataFrame is not _is_copy.

The warning is suggesting to use .loc, but if you use .loc on a frame that _is_copy, you will still get the same warning. Misleading? Yes. Annoying? You bet. Helpful? Potentially, when chained assignment is used. But it cannot correctly detect chain assignment and prints the warning indiscriminately.


回答 5

这个话题确实让Pandas感到困惑。幸运的是,它有一个相对简单的解决方案。

问题在于,并不总是清楚数据过滤操作(例如loc)是否返回DataFrame的副本或视图。因此,这种过滤后的DataFrame的进一步使用可能会造成混淆。

简单的解决方案是(除非您需要处理非常大的数据集):

每当需要更新任何值时,请始终确保在分配之前隐式复制DataFrame。

df  # Some DataFrame
df = df.loc[:, 0:2]  # Some filtering (unsure whether a view or copy is returned)
df = df.copy()  # Ensuring a copy is made
df[df["Name"] == "John"] = "Johny"  # Assignment can be done now (no warning)

This topic is really confusing with Pandas. Luckily, it has a relatively simple solution.

The problem is that it is not always clear whether data filtering operations (e.g. loc) return a copy or a view of the DataFrame. Further use of such filtered DataFrame could therefore be confusing.

The simple solution is (unless you need to work with very large sets of data):

Whenever you need to update any values, always make sure that you implicitely copy the DataFrame before the assignment.

df  # Some DataFrame
df = df.loc[:, 0:2]  # Some filtering (unsure whether a view or copy is returned)
df = df.copy()  # Ensuring a copy is made
df[df["Name"] == "John"] = "Johny"  # Assignment can be done now (no warning)


回答 6

为了消除任何疑问,我的解决方案是制作切片的深层副本,而不是常规副本。根据您的上下文,这可能不适用(内存限制/切片的大小,潜在的性能下降-特别是如果复制像对我一样在一个循环中发生,等等。)

需要明确的是,这是我收到的警告:

/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

插图

我怀疑是否由于我将一列放在切片的副本上而引发警告。虽然从技术上讲,它不是在切片副本中尝试设置值,但是这仍然是切片副本的修改。以下是我为确认怀疑而采取的(简化)步骤,希望它能对那些试图了解警告的人有所帮助。

示例1:在原件上放置一列会影响复印

我们已经知道了,但这是健康的提醒。这是不是警告是关于什么的。

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123


>> df2 = df1
>> df2

A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 affects df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    B
0   121
1   122
2   123

可以避免对df1进行更改以影响df2

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 does not affect df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    A   B
0   111 121
1   112 122
2   113 123

示例2:在副本上放置一列可能会影响原始

这实际上说明了警告。

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> df2 = df1
>> df2

    A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df2 can affect df1
# No slice involved here, but I believe the principle remains the same?
# Let me know if not
>> df2.drop('A', axis=1, inplace=True)
>> df1

B
0   121
1   122
2   123

可以避免对df2进行更改以影响df1

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2

A   B
0   111 121
1   112 122
2   113 123

>> df2.drop('A', axis=1, inplace=True)
>> df1

A   B
0   111 121
1   112 122
2   113 123

干杯!

To remove any doubt, my solution was to make a deep copy of the slice instead of a regular copy. This may not be applicable depending on your context (Memory constraints / size of the slice, potential for performance degradation – especially if the copy occurs in a loop like it did for me, etc…)

To be clear, here is the warning I received:

/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Illustration

I had doubts that the warning was thrown because of a column I was dropping on a copy of the slice. While not technically trying to set a value in the copy of the slice, that was still a modification of the copy of the slice. Below are the (simplified) steps I have taken to confirm the suspicion, I hope it will help those of us who are trying to understand the warning.

Example 1: dropping a column on the original affects the copy

We knew that already but this is a healthy reminder. This is NOT what the warning is about.

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123


>> df2 = df1
>> df2

A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 affects df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    B
0   121
1   122
2   123

It is possible to avoid changes made on df1 to affect df2

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 does not affect df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    A   B
0   111 121
1   112 122
2   113 123

Example 2: dropping a column on the copy may affect the original

This actually illustrates the warning.

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> df2 = df1
>> df2

    A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df2 can affect df1
# No slice involved here, but I believe the principle remains the same?
# Let me know if not
>> df2.drop('A', axis=1, inplace=True)
>> df1

B
0   121
1   122
2   123

It is possible to avoid changes made on df2 to affect df1

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2

A   B
0   111 121
1   112 122
2   113 123

>> df2.drop('A', axis=1, inplace=True)
>> df1

A   B
0   111 121
1   112 122
2   113 123

Cheers!


回答 7

这应该工作:

quote_df.loc[:,'TVol'] = quote_df['TVol']/TVOL_SCALE

This should work:

quote_df.loc[:,'TVol'] = quote_df['TVol']/TVOL_SCALE

回答 8

有些人可能想简单地消除警告:

class SupressSettingWithCopyWarning:
    def __enter__(self):
        pd.options.mode.chained_assignment = None

    def __exit__(self, *args):
        pd.options.mode.chained_assignment = 'warn'

with SupressSettingWithCopyWarning():
    #code that produces warning

Some may want to simply suppress the warning:

class SupressSettingWithCopyWarning:
    def __enter__(self):
        pd.options.mode.chained_assignment = None

    def __exit__(self, *args):
        pd.options.mode.chained_assignment = 'warn'

with SupressSettingWithCopyWarning():
    #code that produces warning

回答 9

如果您已将切片分配给变量,并希望使用变量进行设置,如下所示:

df2 = df[df['A'] > 2]
df2['B'] = value

而且由于您的条件计算df2时间太长或出于某些其他原因,您不想使用Jeffs解决方案,那么您可以使用以下方法:

df.loc[df2.index.tolist(), 'B'] = value

df2.index.tolist() 返回df2中所有条目的索引,然后将这些索引用于设置原始数据帧中的B列。

If you have assigned the slice to a variable and want to set using the variable as in the following:

df2 = df[df['A'] > 2]
df2['B'] = value

And you do not want to use Jeffs solution because your condition computing df2 is to long or for some other reason, then you can use the following:

df.loc[df2.index.tolist(), 'B'] = value

df2.index.tolist() returns the indices from all entries in df2, which will then be used to set column B in the original dataframe.


回答 10

对我来说,此问题发生在下面的> simplified <示例中。我也能够解决它(希望有一个正确的解决方案):

带有警告的旧代码:

def update_old_dataframe(old_dataframe, new_dataframe):
    for new_index, new_row in new_dataframe.iterrorws():
        old_dataframe.loc[new_index] = update_row(old_dataframe.loc[new_index], new_row)

def update_row(old_row, new_row):
    for field in [list_of_columns]:
        # line with warning because of chain indexing old_dataframe[new_index][field]
        old_row[field] = new_row[field]  
    return old_row

这打印了该行的警告 old_row[field] = new_row[field]

由于update_row方法中的行实际上是type Series,因此我将其替换为:

old_row.at[field] = new_row.at[field]

即用于访问/查找的方法Series。尽管两者都可以正常工作并且结果是相同的,但是通过这种方式,我不必禁用警告(=将其保留在其他地方的其他链索引问题中)。

我希望这可以帮助某人。

For me this issue occured in a following >simplified< example. And I was also able to solve it (hopefully with a correct solution):

old code with warning:

def update_old_dataframe(old_dataframe, new_dataframe):
    for new_index, new_row in new_dataframe.iterrorws():
        old_dataframe.loc[new_index] = update_row(old_dataframe.loc[new_index], new_row)

def update_row(old_row, new_row):
    for field in [list_of_columns]:
        # line with warning because of chain indexing old_dataframe[new_index][field]
        old_row[field] = new_row[field]  
    return old_row

This printed the warning for the line old_row[field] = new_row[field]

Since the rows in update_row method are actually type Series, I replaced the line with:

old_row.at[field] = new_row.at[field]

i.e. method for accessing/lookups for a Series. Eventhough both works just fine and the result is same, this way I don’t have to disable the warnings (=keep them for other chain indexing issues somewhere else).

I hope this may help someone.


回答 11

我相信您可以避免像这样的整个问题:

return (
    pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
    .rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
    .ix[:,[0,3,2,1,4,5,8,9,30,31]]
    .assign(
        TClose=lambda df: df['TPrice'],
        RT=lambda df: 100 * (df['TPrice']/quote_df['TPCLOSE'] - 1),
        TVol=lambda df: df['TVol']/TVOL_SCALE,
        TAmt=lambda df: df['TAmt']/TAMT_SCALE,
        STK_ID=lambda df: df['STK'].str.slice(13,19),
        STK_Name=lambda df: df['STK'].str.slice(21,30)#.decode('gb2312'),
        TDate=lambda df: df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10]),
    )
)

使用分配。从文档中:将新列分配给DataFrame,返回一个新对象(一个副本),其中除新列外还包含所有原始列。

参见汤姆·奥格斯珀格(Tom Augspurger)关于熊猫方法链接的文章:https ://tomaugspurger.github.io/method-chaining

You could avoid the whole problem like this, I believe:

return (
    pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
    .rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
    .ix[:,[0,3,2,1,4,5,8,9,30,31]]
    .assign(
        TClose=lambda df: df['TPrice'],
        RT=lambda df: 100 * (df['TPrice']/quote_df['TPCLOSE'] - 1),
        TVol=lambda df: df['TVol']/TVOL_SCALE,
        TAmt=lambda df: df['TAmt']/TAMT_SCALE,
        STK_ID=lambda df: df['STK'].str.slice(13,19),
        STK_Name=lambda df: df['STK'].str.slice(21,30)#.decode('gb2312'),
        TDate=lambda df: df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10]),
    )
)

Using Assign. From the documentation: Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

See Tom Augspurger’s article on method chaining in pandas: https://tomaugspurger.github.io/method-chaining


回答 12

后续初学者问题/备注

也许是对其他像我这样的初学者的澄清(我来自R,似乎在幕后工作有所不同)。以下看起来无害且功能正常的代码不断产生SettingWithCopy警告,但我不知道为什么。我已经阅读并理解了带有“链式索引”的内容,但是我的代码不包含任何内容:

def plot(pdb, df, title, **kw):
    df['target'] = (df['ogg'] + df['ugg']) / 2
    # ...

但是后来,太晚了,我查看了plot()函数的调用位置:

    df = data[data['anz_emw'] > 0]
    pixbuf = plot(pdb, df, title)

因此,“ df”不是数据帧,而是某种对象,它会以某种方式记住它是通过索引数据帧而创建的(因此是视图?),这将使plot()中的行成为可能。

 df['target'] = ...

相当于

 data[data['anz_emw'] > 0]['target'] = ...

这是一个链接索引。我说对了吗?

无论如何,

def plot(pdb, df, title, **kw):
    df.loc[:,'target'] = (df['ogg'] + df['ugg']) / 2

固定它。

Followup beginner question / remark

Maybe a clarification for other beginners like me (I come from R which seems to work a bit differently under the hood). The following harmless-looking and functional code kept producing the SettingWithCopy warning, and I couldn’t figure out why. I had both read and understood the issued with “chained indexing”, but my code doesn’t contain any:

def plot(pdb, df, title, **kw):
    df['target'] = (df['ogg'] + df['ugg']) / 2
    # ...

But then, later, much too late, I looked at where the plot() function is called:

    df = data[data['anz_emw'] > 0]
    pixbuf = plot(pdb, df, title)

So “df” isn’t a data frame but an object that somehow remembers that it was created by indexing a data frame (so is that a view?) which would make the line in plot()

 df['target'] = ...

equivalent to

 data[data['anz_emw'] > 0]['target'] = ...

which is a chained indexing. Did I get that right?

Anyway,

def plot(pdb, df, title, **kw):
    df.loc[:,'target'] = (df['ogg'] + df['ugg']) / 2

fixed it.


回答 13

由于这个问题已经在现有答案中得到了充分的解释和讨论,因此我将为pandas上下文管理器提供一种简洁的方法,使用pandas.option_context(指向文档示例的链接)-绝对不需要使用所有dunder方法和其他方法创建自定义类和口哨声。

首先,上下文管理器代码本身:

from contextlib import contextmanager

@contextmanager
def SuppressPandasWarning():
    with pd.option_context("mode.chained_assignment", None):
        yield

再举一个例子:

import pandas as pd
from string import ascii_letters

a = pd.DataFrame({"A": list(ascii_letters[0:4]), "B": range(0,4)})

mask = a["A"].isin(["c", "d"])
# Even shallow copy below is enough to not raise the warning, but why is a mystery to me.
b = a.loc[mask]  # .copy(deep=False)

# Raises the `SettingWithCopyWarning`
b["B"] = b["B"] * 2

# Does not!
with SuppressPandasWarning():
    b["B"] = b["B"] * 2

值得一提的是,这两个方法均未修改a,这对我来说有点令人惊讶,即使是带有df的浅表副本.copy(deep=False)也将阻止发出此警告(据我所知,浅表副本也应至少a也应进行修改,但它不会’t。pandas魔术。)。

As this question is already fully explained and discussed in existing answers I will just provide a neat pandas approach to the context manager using pandas.option_context (links to docs and example) – there is absolutely no need to create a custom class with all the dunder methods and other bells and whistles.

First the context manager code itself:

from contextlib import contextmanager

@contextmanager
def SuppressPandasWarning():
    with pd.option_context("mode.chained_assignment", None):
        yield

Then an example:

import pandas as pd
from string import ascii_letters

a = pd.DataFrame({"A": list(ascii_letters[0:4]), "B": range(0,4)})

mask = a["A"].isin(["c", "d"])
# Even shallow copy below is enough to not raise the warning, but why is a mystery to me.
b = a.loc[mask]  # .copy(deep=False)

# Raises the `SettingWithCopyWarning`
b["B"] = b["B"] * 2

# Does not!
with SuppressPandasWarning():
    b["B"] = b["B"] * 2

Worth noticing is that both approches do not modify a, which is a bit surprising to me, and even a shallow df copy with .copy(deep=False) would prevent this warning to be raised (as far as I understand shallow copy should at least modify a as well, but it doesn’t. pandas magic.).


回答 14

.apply()从使用该.query()方法的现有数据帧分配新数据帧时,我一直遇到这个问题。例如:

prop_df = df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)

将返回此错误。在这种情况下,似乎可以解决该错误的修补程序是将其更改为:

prop_df = df.copy(deep=True)
prop_df = prop_df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)

但是,由于必须进行新的复制,因此这在使用大型数据帧时效率不高。

如果.apply()在生成新列及其值时使用该方法,则可以通过添加以下方法来解决错误并提高效率.reset_index(drop=True)

prop_df = df.query('column == "value"').reset_index(drop=True)
prop_df['new_column'] = prop_df.apply(function, axis=1)

I had been getting this issue with .apply() when assigning a new dataframe from a pre-existing dataframe on which i’ve used the .query() method. For instance:

prop_df = df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)

Would return this error. The fix that seems to resolve the error in this case is by changing this to:

prop_df = df.copy(deep=True)
prop_df = prop_df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)

However, this is NOT efficient especially when using large dataframes, due to having to make a new copy.

If you’re using the .apply() method in generating a new column and its values, a fix that resolves the error and is more efficient is by adding .reset_index(drop=True):

prop_df = df.query('column == "value"').reset_index(drop=True)
prop_df['new_column'] = prop_df.apply(function, axis=1)

如何扩展输出显示以查看pandas DataFrame的更多列?

问题:如何扩展输出显示以查看pandas DataFrame的更多列?

有没有办法在交互式或脚本执行模式下扩大输出的显示?

具体来说,我describe()在pandas上使用该功能DataFrame。当DataFrame5列(标签)宽时,我得到了所需的描述性统计信息。但是,如果DataFrame具有更多列,则统计信息将被抑制,并返回如下所示的内容:

>> Index: 8 entries, count to max  
>> Data columns:  
>> x1          8  non-null values  
>> x2          8  non-null values  
>> x3          8  non-null values  
>> x4          8  non-null values  
>> x5          8  non-null values  
>> x6          8  non-null values  
>> x7          8  non-null values  

无论是6列还是7列,都会给出“ 8”值。“ 8”是什么意思?

我已经尝试过将IDLE窗口拖动更大,并增加“ Configure IDLE”宽度选项,但无济于事。

我使用熊猫的目的describe()是避免使用诸如Stata之类的第二个程序来进行基本的数据操作和调查。

Is there a way to widen the display of output in either interactive or script-execution mode?

Specifically, I am using the describe() function on a pandas DataFrame. When the DataFrame is 5 columns (labels) wide, I get the descriptive statistics that I want. However, if the DataFrame has any more columns, the statistics are suppressed and something like this is returned:

>> Index: 8 entries, count to max  
>> Data columns:  
>> x1          8  non-null values  
>> x2          8  non-null values  
>> x3          8  non-null values  
>> x4          8  non-null values  
>> x5          8  non-null values  
>> x6          8  non-null values  
>> x7          8  non-null values  

The “8” value is given whether there are 6 or 7 columns. What does the “8” refer to?

I have already tried dragging the IDLE window larger, as well as increasing the “Configure IDLE” width options, to no avail.

My purpose in using pandas and describe() is to avoid using a second program like Stata to do basic data manipulation and investigation.


回答 0

更新:熊猫0.23.4起

这不是必须的,如果设置,pandas会自动检测终端窗口的大小pd.options.display.width = 0。(有关较旧的版本,请参阅底部。)

pandas.set_printoptions(...)不推荐使用。而是使用pandas.set_option(optname, val)或等效地pd.options.<opt.hierarchical.name> = val。喜欢:

import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

这是帮助set_option

set_option(pat,value)-设置指定选项的值

可用选项:
显示。[chop_threshold,colheader_justify,column_space,date_dayfirst,
         date_yearfirst,编码,expand_frame_repr,float_format,高度,
         line_width,max_columns,max_colwidth,max_info_columns,max_info_rows,
         max_rows,max_seq_items,mpl_style,multi_sparse,notebook_repr_html,
         pprint_nest_depth,精度,宽度]
模式。[sim_interactive,use_inf_as_null]

参量
----------
pat-str / regexp,应与单个选项匹配。

注意:为方便起见,支持部分匹配,但除非您使用
完整的选项名称(egxyzoption_name),将来您的代码可能会中断
版本,如果引入了具有相似名称的新选项。

value-期权的新价值。

退货
-------
没有

加薪
------
如果没有这样的选项,则为KeyError

display.chop_threshold:[默认:无] [当前:无]
:浮动或无
        如果设置为浮点值,则所有浮点值均小于给定阈值
        将由repr和朋友显示为正好为0。
display.colheader_justify:[默认:正确] [当前:正确]
: '左右'
        控制列标题的对正。由DataFrameFormatter使用。
display.column_space:[默认值:12] [当前:12]无可用描述。

display.date_dayfirst:[默认:False] [当前:False]
:布尔值
        如果为True,则打印和解析日期的日期为第一天,例如20/01/2005
display.date_yearfirst:[默认:False] [当前:False]
:布尔值
        如果为True,则打印和解析日期以年份为第一,例如2005/01/20
display.encoding:[默认:UTF-8] [当前:UTF-8]
:str / unicode
        默认为检测到的控制台编码。
        指定用于to_string返回的字符串的编码,
        这些通常是要在控制台上显示的字符串。
display.expand_frame_repr:[默认:True] [当前:True]
:布尔值
        是否为宽数据框打印完整的数据框代表
        跨多行,`max_columns`仍然受到尊重,但是输出将
        如果宽度超过“ display.width”,则跨多个“页面”进行环绕。
display.float_format:[默认:无] [当前:无]
:可调用
        可调用对象应接受浮点数并返回
        具有所需数字格式的字符串。这用
        在某些地方,例如SeriesFormatter。
        有关示例,请参见core.format.EngFormatter。
display.height:[默认值:60] [当前:1000]
:int
        不推荐使用。
        (已弃用,请改用display.height。)

display.line_width:[默认值:80] [当前:1000]
:int
        不推荐使用。
        (已弃用,请改用display.width。)

display.max_columns:[默认:20] [当前:500]
:int
        在__repr __()方法中使用max_rows和max_columns来确定是否
        to_string()或info()用于将对象呈现为字符串。如果
        python / IPython在终端中运行,可以将其设置为0和pandas
        将正确地自动检测终端的宽度并交换为较小的宽度
        格式,以防所有列都不能垂直放置。IPython笔记本,
        IPython qtconsole或IDLE不在终端中运行,因此它不是
        可以进行正确的自动检测。
        “无”值意味着无限。
display.max_colwidth:[默认:50] [当前:50]
:int
        列的最大宽度(以字符为单位)
        大熊猫数据结构。当列溢出时,会出现一个“ ...”
        占位符嵌入在输出中。
display.max_info_columns:[默认:100] [当前:100]
:int
        在DataFrame.info方法中使用max_info_columns来确定是否
        每列信息将被打印。
display.max_info_rows:[默认:1690785] [当前:1690785]
:int或无
        max_info_rows是一帧将要进行的最大行数
        重新进入控制台时,对其列执行null检查。
        默认值为1,000,000行。因此,如果DataFrame具有更多
        1,000,000行将不会对
        列,因此表示将花费更少的时间
        在互动会话中显示。值None表示总是
        重复时执行空检查。
display.max_rows:[默认:60] [当前:500]
:int
        设置打印时熊猫应输出的最大行数
        各种输出。例如,此值确定是否repr()
        数据框完全打印出来或只是摘要表示。
        “无”值意味着无限。
display.max_seq_items:[默认:无] [当前:无]
:int或无

        漂亮地打印长序列时,不超过`max_seq_items`
        将被打印。如果省略项目,将用加法表示
        “ ...”到结果字符串。

        如果设置为“无”,则要打印的项目数不受限制。
display.mpl_style:[默认:无] [当前:无]
:布尔

        将此设置为“默认”将修改matplotlib使用的rcParams
        默认情况下,为绘图提供更令人愉悦的视觉样式。
        将此设置为None / False会将值恢复为其初始值。
display.multi_sparse:[默认:True] [当前:True]
:布尔值
        “ sparsify” MultiIndex显示(不重复显示
        组内外层的元素)
display.notebook_repr_html:[默认:True] [当前:True]
:布尔值
        如果为True,则IPython Notebook将使用html表示形式
        熊猫对象(如果有)。
display.pprint_nest_depth:[默认值:3] [当前:3]
:int
        控制漂亮打印时要处理的嵌套层数
display.precision:[默认:7] [当前:7]
:int
        浮点输出精度(有效位数)。这是
        只是一个建议
display.width:[默认值:80] [当前:1000]
:int
        显示的宽度(以字符为单位)。如果python / IPython运行在
        可以将其设置为“无”的终端,熊猫会正确自动检测
        宽度。
        请注意,IPython笔记本,IPython qtconsole或IDLE不会在
        终端,因此无法正确检测宽度。
mode.sim_interactive:[默认:False] [当前:False]
:布尔值
        是否为了测试目的而模拟交互模式
mode.use_inf_as_null:[默认:False] [当前:False]
:布尔值
        True表示将None,NaN,INF,-INF视为null(旧方法),
        False表示None和NaN为空,但INF,-INF不为空
        (新方法)。
呼叫def:pd.set_option(self,* args,** kwds)

编辑:较旧的版本信息,其中许多已被弃用。

如@bmu 所述,pandas自动检测(默认情况下)显示区域的大小,当对象代表不适合显示时,将使用摘要视图。您提到了调整“ IDLE”窗口的大小,但没有任何效果。如果可以print df.describe().to_string(),它是否适合于“ IDLE”窗口?

终端大小由pandas.util.terminal.get_terminal_size()(已弃用和移除)确定,这将返回一个包含(width, height)显示内容的元组。输出是否与您的IDLE窗口的大小匹配?可能存在问题(在emacs中运行终端之前有一个问题)。

请注意,可以绕过自动检测,pandas.set_printoptions(max_rows=200, max_columns=10)如果行数,列数不超过给定的限制,则永远不会切换到摘要视图。


“ max_colwidth”选项有助于查看每列的截断形式。

截断的列显示

Update: Pandas 0.23.4 onwards

This is not necessary, pandas autodetects the size of your terminal window if you set pd.options.display.width = 0. (For older versions see at bottom.)

pandas.set_printoptions(...) is deprecated. Instead, use pandas.set_option(optname, val), or equivalently pd.options.<opt.hierarchical.name> = val. Like:

import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Here is the help for set_option:

set_option(pat,value) - Sets the value of the specified option

Available options:
display.[chop_threshold, colheader_justify, column_space, date_dayfirst,
         date_yearfirst, encoding, expand_frame_repr, float_format, height,
         line_width, max_columns, max_colwidth, max_info_columns, max_info_rows,
         max_rows, max_seq_items, mpl_style, multi_sparse, notebook_repr_html,
         pprint_nest_depth, precision, width]
mode.[sim_interactive, use_inf_as_null]

Parameters
----------
pat - str/regexp which should match a single option.

Note: partial matches are supported for convenience, but unless you use the
full option name (e.g. x.y.z.option_name), your code may break in future
versions if new options with similar names are introduced.

value - new value of option.

Returns
-------
None

Raises
------
KeyError if no such option exists

display.chop_threshold: [default: None] [currently: None]
: float or None
        if set to a float value, all float values smaller then the given threshold
        will be displayed as exactly 0 by repr and friends.
display.colheader_justify: [default: right] [currently: right]
: 'left'/'right'
        Controls the justification of column headers. used by DataFrameFormatter.
display.column_space: [default: 12] [currently: 12]No description available.

display.date_dayfirst: [default: False] [currently: False]
: boolean
        When True, prints and parses dates with the day first, eg 20/01/2005
display.date_yearfirst: [default: False] [currently: False]
: boolean
        When True, prints and parses dates with the year first, eg 2005/01/20
display.encoding: [default: UTF-8] [currently: UTF-8]
: str/unicode
        Defaults to the detected encoding of the console.
        Specifies the encoding to be used for strings returned by to_string,
        these are generally strings meant to be displayed on the console.
display.expand_frame_repr: [default: True] [currently: True]
: boolean
        Whether to print out the full DataFrame repr for wide DataFrames
        across multiple lines, `max_columns` is still respected, but the output will
        wrap-around across multiple "pages" if it's width exceeds `display.width`.
display.float_format: [default: None] [currently: None]
: callable
        The callable should accept a floating point number and return
        a string with the desired format of the number. This is used
        in some places like SeriesFormatter.
        See core.format.EngFormatter for an example.
display.height: [default: 60] [currently: 1000]
: int
        Deprecated.
        (Deprecated, use `display.height` instead.)

display.line_width: [default: 80] [currently: 1000]
: int
        Deprecated.
        (Deprecated, use `display.width` instead.)

display.max_columns: [default: 20] [currently: 500]
: int
        max_rows and max_columns are used in __repr__() methods to decide if
        to_string() or info() is used to render an object to a string.  In case
        python/IPython is running in a terminal this can be set to 0 and pandas
        will correctly auto-detect the width the terminal and swap to a smaller
        format in case all columns would not fit vertically. The IPython notebook,
        IPython qtconsole, or IDLE do not run in a terminal and hence it is not
        possible to do correct auto-detection.
        'None' value means unlimited.
display.max_colwidth: [default: 50] [currently: 50]
: int
        The maximum width in characters of a column in the repr of
        a pandas data structure. When the column overflows, a "..."
        placeholder is embedded in the output.
display.max_info_columns: [default: 100] [currently: 100]
: int
        max_info_columns is used in DataFrame.info method to decide if
        per column information will be printed.
display.max_info_rows: [default: 1690785] [currently: 1690785]
: int or None
        max_info_rows is the maximum number of rows for which a frame will
        perform a null check on its columns when repr'ing To a console.
        The default is 1,000,000 rows. So, if a DataFrame has more
        1,000,000 rows there will be no null check performed on the
        columns and thus the representation will take much less time to
        display in an interactive session. A value of None means always
        perform a null check when repr'ing.
display.max_rows: [default: 60] [currently: 500]
: int
        This sets the maximum number of rows pandas should output when printing
        out various output. For example, this value determines whether the repr()
        for a dataframe prints out fully or just a summary repr.
        'None' value means unlimited.
display.max_seq_items: [default: None] [currently: None]
: int or None

        when pretty-printing a long sequence, no more then `max_seq_items`
        will be printed. If items are ommitted, they will be denoted by the addition
        of "..." to the resulting string.

        If set to None, the number of items to be printed is unlimited.
display.mpl_style: [default: None] [currently: None]
: bool

        Setting this to 'default' will modify the rcParams used by matplotlib
        to give plots a more pleasing visual style by default.
        Setting this to None/False restores the values to their initial value.
display.multi_sparse: [default: True] [currently: True]
: boolean
        "sparsify" MultiIndex display (don't display repeated
        elements in outer levels within groups)
display.notebook_repr_html: [default: True] [currently: True]
: boolean
        When True, IPython notebook will use html representation for
        pandas objects (if it is available).
display.pprint_nest_depth: [default: 3] [currently: 3]
: int
        Controls the number of nested levels to process when pretty-printing
display.precision: [default: 7] [currently: 7]
: int
        Floating point output precision (number of significant digits). This is
        only a suggestion
display.width: [default: 80] [currently: 1000]
: int
        Width of the display in characters. In case python/IPython is running in
        a terminal this can be set to None and pandas will correctly auto-detect the
        width.
        Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
        terminal and hence it is not possible to correctly detect the width.
mode.sim_interactive: [default: False] [currently: False]
: boolean
        Whether to simulate interactive mode for purposes of testing
mode.use_inf_as_null: [default: False] [currently: False]
: boolean
        True means treat None, NaN, INF, -INF as null (old way),
        False means None and NaN are null, but INF, -INF are not null
        (new way).
Call def:   pd.set_option(self, *args, **kwds)

EDIT: older version information, much of this has been deprecated.

As @bmu mentioned, pandas auto detects (by default) the size of the display area, a summary view will be used when an object repr does not fit on the display. You mentioned resizing the IDLE window, to no effect. If you do print df.describe().to_string() does it fit on the IDLE window?

The terminal size is determined by pandas.util.terminal.get_terminal_size() (deprecated and removed), this returns a tuple containing the (width, height) of the display. Does the output match the size of your IDLE window? There might be an issue (there was one before when running a terminal in emacs).

Note that it is possible to bypass the autodetect, pandas.set_printoptions(max_rows=200, max_columns=10) will never switch to summary view if number of rows, columns does not exceed the given limits.


The ‘max_colwidth’ option helps in seeing untruncated form of each column.

TruncatedColumnDisplay


回答 1

尝试这个:

pd.set_option('display.expand_frame_repr', False)

从文档中:

display.expand_frame_repr:布尔值

是否跨多行打印宽数据帧的完整DataFrame repr,仍会考虑max_columns,但是如果宽度超过display.width,则输出将在多个“页面”中回绕。[默认:真] [当前:真]

请参阅:http : //pandas.pydata.org/pandas-docs/stable/generated/pandas.set_option.html

Try this:

pd.set_option('display.expand_frame_repr', False)

From the documentation:

display.expand_frame_repr : boolean

Whether to print out the full DataFrame repr for wide DataFrames across multiple lines, max_columns is still respected, but the output will wrap-around across multiple “pages” if it’s width exceeds display.width. [default: True] [currently: True]

See: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.set_option.html


回答 2

如果要临时设置选项以显示一个大的DataFrame,则可以使用option_context

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print (df)

退出with块时,选项值将自动恢复。

If you want to set options temporarily to display one large DataFrame, you can use option_context:

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print (df)

Option values are restored automatically when you exit the with block.


回答 3

仅使用以下3行对我有用:

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

Anaconda / Python 3.6.5 /熊猫:0.23.0 / Visual Studio Code 1.26

Only using these 3 lines worked for me:

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

Anaconda / Python 3.6.5 / pandas: 0.23.0 / Visual Studio Code 1.26


回答 4

使用以下方法设置列的最大宽度:

pd.set_option('max_colwidth', 800)

该特定语句将每列的最大宽度设置为800px。

Set column max width using:

pd.set_option('max_colwidth', 800)

This particular statement sets max width to 800px, per column.


回答 5

您可以使用print df.describe().to_string()它来强制显示整个表格。(您可以to_string()对任何DataFrame像这样使用。结果describe只是一个DataFrame本身。)

8是保存“描述”的DataFrame中的行数(因为describe计算8个统计信息,最小值,最大值,平均值等)。

You can use print df.describe().to_string() to force it to show the whole table. (You can use to_string() like this for any DataFrame. The result of describe is just a DataFrame itself.)

The 8 is the number of rows in the DataFrame holding the “description” (because describe computes 8 statistics, min, max, mean, etc.).


回答 6

您可以使用调整熊猫打印选项set_printoptions

In [3]: df.describe()
Out[3]: 
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, count to max
Data columns:
x1    8  non-null values
x2    8  non-null values
x3    8  non-null values
x4    8  non-null values
x5    8  non-null values
x6    8  non-null values
x7    8  non-null values
dtypes: float64(7)

In [4]: pd.set_printoptions(precision=2)

In [5]: df.describe()
Out[5]: 
            x1       x2       x3       x4       x5       x6       x7
count      8.0      8.0      8.0      8.0      8.0      8.0      8.0
mean   69024.5  69025.5  69026.5  69027.5  69028.5  69029.5  69030.5
std       17.1     17.1     17.1     17.1     17.1     17.1     17.1
min    69000.0  69001.0  69002.0  69003.0  69004.0  69005.0  69006.0
25%    69012.2  69013.2  69014.2  69015.2  69016.2  69017.2  69018.2
50%    69024.5  69025.5  69026.5  69027.5  69028.5  69029.5  69030.5
75%    69036.8  69037.8  69038.8  69039.8  69040.8  69041.8  69042.8
max    69049.0  69050.0  69051.0  69052.0  69053.0  69054.0  69055.0

但是,这并不是在所有情况下都可行,因为熊猫会检测到您的控制台宽度,并且仅to_string在输出适合控制台时才使用(请参阅的文档字符串set_printoptions)。在这种情况下,你可以显式调用to_string由作为回答BrenBarn

更新资料

对于0.10版,更改了宽数据帧的打印方式:

In [3]: df.describe()
Out[3]: 
                 x1            x2            x3            x4            x5  \
count      8.000000      8.000000      8.000000      8.000000      8.000000   
mean   59832.361578  27356.711336  49317.281222  51214.837838  51254.839690   
std    22600.723536  26867.192716  28071.737509  21012.422793  33831.515761   
min    31906.695474   1648.359160     56.378115  16278.322271     43.745574   
25%    45264.625201  12799.540572  41429.628749  40374.273582  29789.643875   
50%    56340.214856  18666.456293  51995.661512  54894.562656  47667.684422   
75%    75587.003417  31375.610322  61069.190523  67811.893435  76014.884048   
max    98136.474782  84544.484627  91743.983895  75154.587156  99012.695717   

                 x6            x7  
count      8.000000      8.000000  
mean   41863.000717  33950.235126  
std    38709.468281  29075.745673  
min     3590.990740   1833.464154  
25%    15145.759625   6879.523949  
50%    22139.243042  33706.029946  
75%    72038.983496  51449.893980  
max    98601.190488  83309.051963  

进一步更改了用于设置熊猫选项的API:

In [4]: pd.set_option('display.precision', 2)

In [5]: df.describe()
Out[5]: 
            x1       x2       x3       x4       x5       x6       x7
count      8.0      8.0      8.0      8.0      8.0      8.0      8.0
mean   59832.4  27356.7  49317.3  51214.8  51254.8  41863.0  33950.2
std    22600.7  26867.2  28071.7  21012.4  33831.5  38709.5  29075.7
min    31906.7   1648.4     56.4  16278.3     43.7   3591.0   1833.5
25%    45264.6  12799.5  41429.6  40374.3  29789.6  15145.8   6879.5
50%    56340.2  18666.5  51995.7  54894.6  47667.7  22139.2  33706.0
75%    75587.0  31375.6  61069.2  67811.9  76014.9  72039.0  51449.9
max    98136.5  84544.5  91744.0  75154.6  99012.7  98601.2  83309.1

You can adjust pandas print options with set_printoptions.

In [3]: df.describe()
Out[3]: 
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, count to max
Data columns:
x1    8  non-null values
x2    8  non-null values
x3    8  non-null values
x4    8  non-null values
x5    8  non-null values
x6    8  non-null values
x7    8  non-null values
dtypes: float64(7)

In [4]: pd.set_printoptions(precision=2)

In [5]: df.describe()
Out[5]: 
            x1       x2       x3       x4       x5       x6       x7
count      8.0      8.0      8.0      8.0      8.0      8.0      8.0
mean   69024.5  69025.5  69026.5  69027.5  69028.5  69029.5  69030.5
std       17.1     17.1     17.1     17.1     17.1     17.1     17.1
min    69000.0  69001.0  69002.0  69003.0  69004.0  69005.0  69006.0
25%    69012.2  69013.2  69014.2  69015.2  69016.2  69017.2  69018.2
50%    69024.5  69025.5  69026.5  69027.5  69028.5  69029.5  69030.5
75%    69036.8  69037.8  69038.8  69039.8  69040.8  69041.8  69042.8
max    69049.0  69050.0  69051.0  69052.0  69053.0  69054.0  69055.0

However this will not work in all cases as pandas detects your console width and it will only use to_string if the output fits in the console (see the docstring of set_printoptions). In this case you can explicitly call to_string as answered by BrenBarn.

Update

With version 0.10 the way wide dataframes are printed changed:

In [3]: df.describe()
Out[3]: 
                 x1            x2            x3            x4            x5  \
count      8.000000      8.000000      8.000000      8.000000      8.000000   
mean   59832.361578  27356.711336  49317.281222  51214.837838  51254.839690   
std    22600.723536  26867.192716  28071.737509  21012.422793  33831.515761   
min    31906.695474   1648.359160     56.378115  16278.322271     43.745574   
25%    45264.625201  12799.540572  41429.628749  40374.273582  29789.643875   
50%    56340.214856  18666.456293  51995.661512  54894.562656  47667.684422   
75%    75587.003417  31375.610322  61069.190523  67811.893435  76014.884048   
max    98136.474782  84544.484627  91743.983895  75154.587156  99012.695717   

                 x6            x7  
count      8.000000      8.000000  
mean   41863.000717  33950.235126  
std    38709.468281  29075.745673  
min     3590.990740   1833.464154  
25%    15145.759625   6879.523949  
50%    22139.243042  33706.029946  
75%    72038.983496  51449.893980  
max    98601.190488  83309.051963  

Further more the API for setting pandas options changed:

In [4]: pd.set_option('display.precision', 2)

In [5]: df.describe()
Out[5]: 
            x1       x2       x3       x4       x5       x6       x7
count      8.0      8.0      8.0      8.0      8.0      8.0      8.0
mean   59832.4  27356.7  49317.3  51214.8  51254.8  41863.0  33950.2
std    22600.7  26867.2  28071.7  21012.4  33831.5  38709.5  29075.7
min    31906.7   1648.4     56.4  16278.3     43.7   3591.0   1833.5
25%    45264.6  12799.5  41429.6  40374.3  29789.6  15145.8   6879.5
50%    56340.2  18666.5  51995.7  54894.6  47667.7  22139.2  33706.0
75%    75587.0  31375.6  61069.2  67811.9  76014.9  72039.0  51449.9
max    98136.5  84544.5  91744.0  75154.6  99012.7  98601.2  83309.1

回答 7

您可以设置输出显示以匹配您当前的端子宽度:

pd.set_option('display.width', pd.util.terminal.get_terminal_size()[0])

You can set the output display to match your current terminal width:

pd.set_option('display.width', pd.util.terminal.get_terminal_size()[0])

回答 8

根据v0.18.0文档,如果您在终端机(即非iPython笔记本电脑,qtconsole或IDLE)上运行,则熊猫自动检测屏幕宽度并即时调整屏幕宽度是2线它显示的列:

pd.set_option('display.large_repr', 'truncate')
pd.set_option('display.max_columns', 0)

According to the docs for v0.18.0, if you’re running on a terminal (ie not iPython notebook, qtconsole or IDLE), it’s a 2-liner to have Pandas auto-detect your screen width and adapt on the fly with how many columns it shows:

pd.set_option('display.large_repr', 'truncate')
pd.set_option('display.max_columns', 0)

回答 9

似乎以上所有答案都可以解决问题。还有一点:pd.set_option('option_name')您可以使用(自动完成功能)代替

pd.options.display.width = None

请参阅熊猫文档:选项和设置:

选项具有完整的“点分样式”,不区分大小写的名称(例如 display.max_rows)。您可以直接将选项作为顶级属性的属性来获取/设置options

In [1]: import pandas as pd

In [2]: pd.options.display.max_rows
Out[2]: 15

In [3]: pd.options.display.max_rows = 999

In [4]: pd.options.display.max_rows
Out[4]: 999

[…]

对于max_...参数:

max_rowsmax_columns在使用__repr__()的方法,以决定是否to_string()info()用于呈现的对象为字符串。如果python / IPython在终端中运行,则可以将其设置为0,并且pandas将正确地自动检测终端的宽度,并交换为较小的格式,以防所有列都不能垂直放置。IPython笔记本,IPython qtconsole或IDLE不在终端中运行,因此无法进行正确的自动检测。None”的值意味着无限。[重点不是原文]

对于width参数:

显示的宽度(以字符为单位)。如果python / IPython在终端中运行,则可以将其设置为,None而pandas将正确地自动检测宽度。请注意,IPython笔记本,IPython qtconsole或IDLE不在终端中运行,因此无法正确检测宽度。

It seems like all above answers solve the problem. One more point: instead of pd.set_option('option_name'), you can use the (auto-complete-able)

pd.options.display.width = None

See Pandas doc: Options and Settings:

Options have a full “dotted-style”, case-insensitive name (e.g. display.max_rows). You can get/set options directly as attributes of the top-level options attribute:

In [1]: import pandas as pd

In [2]: pd.options.display.max_rows
Out[2]: 15

In [3]: pd.options.display.max_rows = 999

In [4]: pd.options.display.max_rows
Out[4]: 999

[…]

for the max_... params:

max_rows and max_columns are used in __repr__() methods to decide if to_string() or info() is used to render an object to a string. In case python/IPython is running in a terminal this can be set to 0 and pandas will correctly auto-detect the width the terminal and swap to a smaller format in case all columns would not fit vertically. The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to do correct auto-detection. None’ value means unlimited. [emphasis not in original]

for the width param:

Width of the display in characters. In case python/IPython is running in a terminal this can be set to None and pandas will correctly auto-detect the width. Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to correctly detect the width.


回答 10

import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)

SentenceA = "William likes Piano and Piano likes William"
SentenceB = "Sara likes Guitar"
SentenceC = "Mamoosh likes Piano"
SentenceD = "William is a CS Student"
SentenceE = "Sara is kind"
SentenceF = "Mamoosh is kind"


bowA = SentenceA.split(" ")
bowB = SentenceB.split(" ")
bowC = SentenceC.split(" ")
bowD = SentenceD.split(" ")
bowE = SentenceE.split(" ")
bowF = SentenceF.split(" ")

# Creating a set consisted of all words

wordSet = set(bowA).union(set(bowB)).union(set(bowC)).union(set(bowD)).union(set(bowE)).union(set(bowF))
print("Set of all words is: ", wordSet)

# Initiating dictionary with 0 value for all BOWs

wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)
wordDictC = dict.fromkeys(wordSet, 0)
wordDictD = dict.fromkeys(wordSet, 0)
wordDictE = dict.fromkeys(wordSet, 0)
wordDictF = dict.fromkeys(wordSet, 0)

for word in bowA:
    wordDictA[word] += 1
for word in bowB:
    wordDictB[word] += 1
for word in bowC:
    wordDictC[word] += 1
for word in bowD:
    wordDictD[word] += 1
for word in bowE:
    wordDictE[word] += 1
for word in bowF:
    wordDictF[word] += 1

# Printing Term frequency

print("SentenceA TF: ", wordDictA)
print("SentenceB TF: ", wordDictB)
print("SentenceC TF: ", wordDictC)
print("SentenceD TF: ", wordDictD)
print("SentenceE TF: ", wordDictE)
print("SentenceF TF: ", wordDictF)

print(pd.DataFrame([wordDictA, wordDictB, wordDictB, wordDictC, wordDictD, wordDictE, wordDictF]))

输出:

   CS  Guitar  Mamoosh  Piano  Sara  Student  William  a  and  is  kind  likes
0   0       0        0      2     0        0        2  0    1   0     0      2
1   0       1        0      0     1        0        0  0    0   0     0      1
2   0       1        0      0     1        0        0  0    0   0     0      1
3   0       0        1      1     0        0        0  0    0   0     0      1
4   1       0        0      0     0        1        1  1    0   1     0      0
5   0       0        0      0     1        0        0  0    0   1     1      0
6   0       0        1      0     0        0        0  0    0   1     1      0
import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)

SentenceA = "William likes Piano and Piano likes William"
SentenceB = "Sara likes Guitar"
SentenceC = "Mamoosh likes Piano"
SentenceD = "William is a CS Student"
SentenceE = "Sara is kind"
SentenceF = "Mamoosh is kind"


bowA = SentenceA.split(" ")
bowB = SentenceB.split(" ")
bowC = SentenceC.split(" ")
bowD = SentenceD.split(" ")
bowE = SentenceE.split(" ")
bowF = SentenceF.split(" ")

# Creating a set consisted of all words

wordSet = set(bowA).union(set(bowB)).union(set(bowC)).union(set(bowD)).union(set(bowE)).union(set(bowF))
print("Set of all words is: ", wordSet)

# Initiating dictionary with 0 value for all BOWs

wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)
wordDictC = dict.fromkeys(wordSet, 0)
wordDictD = dict.fromkeys(wordSet, 0)
wordDictE = dict.fromkeys(wordSet, 0)
wordDictF = dict.fromkeys(wordSet, 0)

for word in bowA:
    wordDictA[word] += 1
for word in bowB:
    wordDictB[word] += 1
for word in bowC:
    wordDictC[word] += 1
for word in bowD:
    wordDictD[word] += 1
for word in bowE:
    wordDictE[word] += 1
for word in bowF:
    wordDictF[word] += 1

# Printing Term frequency

print("SentenceA TF: ", wordDictA)
print("SentenceB TF: ", wordDictB)
print("SentenceC TF: ", wordDictC)
print("SentenceD TF: ", wordDictD)
print("SentenceE TF: ", wordDictE)
print("SentenceF TF: ", wordDictF)

print(pd.DataFrame([wordDictA, wordDictB, wordDictB, wordDictC, wordDictD, wordDictE, wordDictF]))

OutPut:

   CS  Guitar  Mamoosh  Piano  Sara  Student  William  a  and  is  kind  likes
0   0       0        0      2     0        0        2  0    1   0     0      2
1   0       1        0      0     1        0        0  0    0   0     0      1
2   0       1        0      0     1        0        0  0    0   0     0      1
3   0       0        1      1     0        0        0  0    0   0     0      1
4   1       0        0      0     0        1        1  1    0   1     0      0
5   0       0        0      0     1        0        0  0    0   1     1      0
6   0       0        1      0     0        0        0  0    0   1     1      0

回答 11

当数据规模很大时,我使用了这些设置。

# environment settings: 
pd.set_option('display.max_column',None)
pd.set_option('display.max_rows',None)
pd.set_option('display.max_seq_items',None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

您可以在这里参考文档

I used these settings when scale of data is high.

# environment settings: 
pd.set_option('display.max_column',None)
pd.set_option('display.max_rows',None)
pd.set_option('display.max_seq_items',None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

You can refer the documentationhere


回答 12

下一行足以显示数据框中的所有列。 pd.set_option('display.max_columns', None)

The below line is enough to display all columns from dataframe. pd.set_option('display.max_columns', None)


回答 13

如果您不想弄乱显示选项,而只想查看此特定的列列表,而无需扩展您查看的每个数据框,则可以尝试:

df.columns.values

If you don’t want to mess with your display options and you just want to see this one particular list of columns without expanding out every dataframe you view, you could try:

df.columns.values

回答 14

您还可以尝试循环:

for col in df.columns: 
    print(col) 

You can also try in a loop:

for col in df.columns: 
    print(col) 

回答 15

您只需执行以下步骤,

  • 您可以如下更改熊猫max_columns功能的选项

    import pandas as pd
    pd.options.display.max_columns = 10

    (这将显示10列,您可以根据需要进行更改)

  • 这样,您可以更改行数,如下所示(如果您还需要更改最大行数)

    pd.options.display.max_rows = 999

    (这允许一次打印999行)

请参考文档以更改熊猫的不同选项/设置

You can simply do the following steps,

  • You can change the options for pandas max_columns feature as follows

    import pandas as pd
    pd.options.display.max_columns = 10
    

    (this allows 10 columns to display, you can change this as you need)

  • Like that you can change the number of rows as you need to display as follows (if you need to change maximum rows as well)

    pd.options.display.max_rows = 999
    

    (this allows to print 999 rows at a time)

Please kindly refer the doc to change different options/settings for pandas