标签归档:pandas

将Pandas Multi-Index转换为专栏

问题:将Pandas Multi-Index转换为专栏

我有一个具有2个索引级别的数据框:

                         value
Trial    measurement
    1              0        13
                   1         3
                   2         4
    2              0       NaN
                   1        12
    3              0        34 

我想变成这样:

Trial    measurement       value

    1              0        13
    1              1         3
    1              2         4
    2              0       NaN
    2              1        12
    3              0        34 

我怎样才能最好地做到这一点?

我需要这样做是因为我想按照此处的指示汇总数据,但是如果将它们用作索引,则无法选择这样的列。

I have a dataframe with 2 index levels:

                         value
Trial    measurement
    1              0        13
                   1         3
                   2         4
    2              0       NaN
                   1        12
    3              0        34 

Which I want to turn into this:

Trial    measurement       value

    1              0        13
    1              1         3
    1              2         4
    2              0       NaN
    2              1        12
    3              0        34 

How can I best do this?

I need this because I want to aggregate the data as instructed here, but I can’t select my columns like that if they are in use as indices.


回答 0

所述reset_index()是一个数据帧熊猫方法,将索引值转移到数据帧为列。参数的默认设置为drop = False(将索引值保留为列)。

您只需.reset_index(inplace=True)在DataFrame名称后添加:

df.reset_index(inplace=True)  

The reset_index() is a pandas DataFrame method that will transfer index values into the DataFrame as columns. The default setting for the parameter is drop=False (which will keep the index values as columns).

All you have to do add .reset_index(inplace=True) after the name of the DataFrame:

df.reset_index(inplace=True)  

回答 1

这并不是真的适用于您的情况,但可能有助于其他人(例如5分钟前的我自己)知道。如果一个人的多重数具有如下相同的名称:

                         value
Trial        Trial
    1              0        13
                   1         3
                   2         4
    2              0       NaN
                   1        12
    3              0        34 

df.reset_index(inplace=True) 将会失败,因为创建的列不能具有相同的名称。

因此,您需要将multindex重命名为df.index = df.index.set_names(['Trial', 'measurement'])

                           value
Trial    measurement       

    1              0        13
    1              1         3
    1              2         4
    2              0       NaN
    2              1        12
    3              0        34 

然后df.reset_index(inplace=True)将像魅力一样工作。

在按年和月对名为datetime的列(不是索引)进行分组之后,我遇到了这个问题live_date,这意味着年和月都被命名了live_date

This doesn’t really apply to your case but could be helpful for others (like myself 5 minutes ago) to know. If one’s multindex have the same name like this:

                         value
Trial        Trial
    1              0        13
                   1         3
                   2         4
    2              0       NaN
                   1        12
    3              0        34 

df.reset_index(inplace=True) will fail, cause the columns that are created cannot have the same names.

So then you need to rename the multindex with df.index = df.index.set_names(['Trial', 'measurement']) to get:

                           value
Trial    measurement       

    1              0        13
    1              1         3
    1              2         4
    2              0       NaN
    2              1        12
    3              0        34 

And then df.reset_index(inplace=True) will work like a charm.

I encountered this problem after grouping by year and month on a datetime-column(not index) called live_date, which meant that both year and month were named live_date.


回答 2

正如@ cs95在评论中提到的,要仅降低一个级别,请使用:

df.reset_index(level=[...])

这样可以避免在重置后必须重新定义所需的索引。

As @cs95 mentioned in a comment, to drop only one level, use:

df.reset_index(level=[...])

This avoids having to redefine your desired index after reset.


在熊猫系列中查找元素的索引

问题:在熊猫系列中查找元素的索引

我知道这是一个非常基本的问题,但是由于某种原因我找不到答案。如何获取python pandas中Series某些元素的索引?(第一次出现就足够了)

即,我想要类似的东西:

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

当然,可以使用循环定义这样的方法:

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

但我认为应该有更好的方法。在那儿?

I know this is a very basic question but for some reason I can’t find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)

I.e., I’d like something like:

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

Certainly, it is possible to define such a method with a loop:

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

but I assume there should be a better way. Is there?


回答 0

>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index[0]
3

尽管我承认应该有一个更好的方法,但这至少避免了迭代和循环遍历对象并将其移至C级别。

>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index[0]
3

Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.


回答 1

转换为索引,您可以使用 get_loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

重复处理

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

如果非连续返回,将返回一个布尔数组

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

内部使用哈希表,速度如此之快

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop

正如Viktor所指出的那样,创建索引有一次性的创建开销(实际上是在使用索引执行某些操作时产生的开销,例如is_unique

In [2]: s = Series(randint(0,10,10000))

In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop

In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop

Converting to an Index, you can use get_loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

Duplicate handling

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

Will return a boolean array if non-contiguous returns

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

Uses a hashtable internally, so fast

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop

As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)

In [2]: s = Series(randint(0,10,10000))

In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop

In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop

回答 2

In [92]: (myseries==7).argmax()
Out[92]: 3

如果您提前知道7个,则此方法有效。您可以使用(myseries == 7).any()进行检查

另一种方法(非常类似于第一个答案)也占多个7(或全无)的原因是

In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
In [92]: (myseries==7).argmax()
Out[92]: 3

This works if you know 7 is there in advance. You can check this with (myseries==7).any()

Another approach (very similar to the first answer) that also accounts for multiple 7’s (or none) is

In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']

回答 3

这里的所有答案给我留下了深刻的印象。这不是一个新的答案,只是尝试总结所有这些方法的时间。我考虑了一个由25个元素组成的系列的情况,并假设了一般情况下索引可以包含任何值,并且您希望索引值与该系列末尾的搜索值相对应。

以下是2013年MacBook Pro的Python 3.7和Pandas 0.25.3版的速度测试。

In [1]: import pandas as pd                                                

In [2]: import numpy as np                                                 

In [3]: data = [406400, 203200, 101600,  76100,  50800,  25400,  19050,  12700, 
   ...:          9500,   6700,   4750,   3350,   2360,   1700,   1180,    850, 
   ...:           600,    425,    300,    212,    150,    106,     75,     53, 
   ...:            38]                                                                               

In [4]: myseries = pd.Series(data, index=range(1,26))                                                

In [5]: myseries[21]                                                                                 
Out[5]: 150

In [7]: %timeit myseries[myseries == 150].index[0]                                                   
416 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries[myseries == 150].first_valid_index()                                        
585 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.where(myseries == 150).first_valid_index()                                  
652 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]                                     
195 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit pd.Series(myseries.index, index=myseries)[150]                 
178 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]                                    
77.4 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit myseries.index[list(myseries).index(150)]
12.7 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit myseries.index[myseries.tolist().index(150)]                   
9.46 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@Jeff的答案似乎是最快的-尽管它不处理重复项。

更正:很抱歉,我错过了一个,@ Alex Spangher使用列表索引方法的解决方案是迄今为止最快的。

更新资料:添加了@EliadL的答案。

希望这可以帮助。

如此简单的操作需要如此复杂的解决方案,而且许多解决方案是如此之慢,真令人惊讶。在某些情况下,超过半毫秒才能找到一系列25的值。

I’m impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.

Here are the speed tests on a 2013 MacBook Pro in Python 3.7 with Pandas version 0.25.3.

In [1]: import pandas as pd                                                

In [2]: import numpy as np                                                 

In [3]: data = [406400, 203200, 101600,  76100,  50800,  25400,  19050,  12700, 
   ...:          9500,   6700,   4750,   3350,   2360,   1700,   1180,    850, 
   ...:           600,    425,    300,    212,    150,    106,     75,     53, 
   ...:            38]                                                                               

In [4]: myseries = pd.Series(data, index=range(1,26))                                                

In [5]: myseries[21]                                                                                 
Out[5]: 150

In [7]: %timeit myseries[myseries == 150].index[0]                                                   
416 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries[myseries == 150].first_valid_index()                                        
585 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.where(myseries == 150).first_valid_index()                                  
652 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]                                     
195 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit pd.Series(myseries.index, index=myseries)[150]                 
178 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]                                    
77.4 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit myseries.index[list(myseries).index(150)]
12.7 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit myseries.index[myseries.tolist().index(150)]                   
9.46 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@Jeff’s answer seems to be the fastest – although it doesn’t handle duplicates.

Correction: Sorry, I missed one, @Alex Spangher’s solution using the list index method is by far the fastest.

Update: Added @EliadL’s answer.

Hope this helps.

Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.


回答 4

尽管同样不令人满意,但另一种方法是:

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

返回:3

使用我正在使用的当前数据集进行时间测试(随机考虑):

[64]:    %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop

In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop


In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop

Another way to do this, although equally unsatisfying is:

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

returns: 3

On time tests using a current dataset I’m working with (consider it random):

[64]:    %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop

In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop


In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop

回答 5

如果您使用numpy,则可以获取一个数组,该数组确定了您的值:

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

这将返回一个包含元素数组的单元素元组,其中7是myseries中的值:

(array([3], dtype=int64),)

If you use numpy, you can get an array of the indecies that your value is found:

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:

(array([3], dtype=int64),)

回答 6

您可以使用Series.idxmax()

>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>> 

you can use Series.idxmax()

>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>> 

回答 7

尚未提及的另一种实现方法是tolist方法:

myseries.tolist().index(7)

假设该系列中存在该值,则应返回正确的索引。

Another way to do it that hasn’t been mentioned yet is the tolist method:

myseries.tolist().index(7)

should return the correct index, assuming the value exists in the Series.


回答 8

通常,您的价值出现在多个指标上:

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

Often your value occurs at multiple indices:

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

回答 9

这是我能找到的最原生和可扩展的方法:

>>> myindex = pd.Series(myseries.index, index=myseries)

>>> myindex[7]
3

>>> myindex[[7, 5, 7]]
7    3
5    4
7    3
dtype: int64

This is the most native and scalable approach I could find:

>>> myindex = pd.Series(myseries.index, index=myseries)

>>> myindex[7]
3

>>> myindex[[7, 5, 7]]
7    3
5    4
7    3
dtype: int64

如何通过密钥按数据组访问熊猫

问题:如何通过密钥按数据组访问熊猫

如何通过密钥访问groupby对象中的相应groupby数据帧?

通过以下groupby:

rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
                   'B': rand.randn(6),
                   'C': rand.randint(0, 20, 6)})
gb = df.groupby(['A'])

我可以遍历它来获取密钥和组:

In [11]: for k, gp in gb:
             print 'key=' + str(k)
             print gp
key=bar
     A         B   C
1  bar -0.611756  18
3  bar -1.072969  10
5  bar -2.301539  18
key=foo
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

我希望能够通过其键访问组:

In [12]: gb['foo']
Out[12]:  
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

但是当我尝试这样做时,gb[('foo',)]我得到了这个奇怪的pandas.core.groupby.DataFrameGroupBy对象,似乎没有任何与我想要的DataFrame相对应的方法。

我能想到的最好的是:

In [13]: def gb_df_key(gb, key, orig_df):
             ix = gb.indices[key]
             return orig_df.ix[ix]

         gb_df_key(gb, 'foo', df)
Out[13]:
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14  

但是,考虑到这些事情上熊猫通常是多么好,这有点令人讨厌。
这样做的内置方式是什么?

How do I access the corresponding groupby dataframe in a groupby object by the key?

With the following groupby:

rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
                   'B': rand.randn(6),
                   'C': rand.randint(0, 20, 6)})
gb = df.groupby(['A'])

I can iterate through it to get the keys and groups:

In [11]: for k, gp in gb:
             print 'key=' + str(k)
             print gp
key=bar
     A         B   C
1  bar -0.611756  18
3  bar -1.072969  10
5  bar -2.301539  18
key=foo
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

I would like to be able to access a group by its key:

In [12]: gb['foo']
Out[12]:  
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

But when I try doing that with gb[('foo',)] I get this weird pandas.core.groupby.DataFrameGroupBy object thing which doesn’t seem to have any methods that correspond to the DataFrame I want.

The best I could think of is:

In [13]: def gb_df_key(gb, key, orig_df):
             ix = gb.indices[key]
             return orig_df.ix[ix]

         gb_df_key(gb, 'foo', df)
Out[13]:
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14  

but this is kind of nasty, considering how nice pandas usually is at these things.
What’s the built-in way of doing this?


回答 0

您可以使用以下get_group方法:

In [21]: gb.get_group('foo')
Out[21]: 
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

注意:这不需要为每个组创建一个中间字典/每个子数据帧的副本,因此与使用来创建朴素的字典相比,其内存效率更高dict(iter(gb))。这是因为它使用了groupby对象中已经可用的数据结构。


您可以使用groupby切片选择不同的列:

In [22]: gb[["A", "B"]].get_group("foo")
Out[22]:
     A         B
0  foo  1.624345
2  foo -0.528172
4  foo  0.865408

In [23]: gb["C"].get_group("foo")
Out[23]:
0     5
2    11
4    14
Name: C, dtype: int64

You can use the get_group method:

In [21]: gb.get_group('foo')
Out[21]: 
     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

Note: This doesn’t require creating an intermediary dictionary / copy of every subdataframe for every group, so will be much more memory-efficient that creating the naive dictionary with dict(iter(gb)). This is because it uses data-structures already available in the groupby object.


You can select different columns using the groupby slicing:

In [22]: gb[["A", "B"]].get_group("foo")
Out[22]:
     A         B
0  foo  1.624345
2  foo -0.528172
4  foo  0.865408

In [23]: gb["C"].get_group("foo")
Out[23]:
0     5
2    11
4    14
Name: C, dtype: int64

回答 1

Python for Data Analysis中的Wes McKinney(熊猫的作者)提供了以下配方:

groups = dict(list(gb))

它返回一个字典,其键是您的组标签,其值是DataFrames,即

groups['foo']

将产生您想要的东西:

     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

Wes McKinney (pandas’ author) in Python for Data Analysis provides the following recipe:

groups = dict(list(gb))

which returns a dictionary whose keys are your group labels and whose values are DataFrames, i.e.

groups['foo']

will yield what you are looking for:

     A         B   C
0  foo  1.624345   5
2  foo -0.528172  11
4  foo  0.865408  14

回答 2

而不是

gb.get_group('foo')

我更喜欢使用 gb.groups

df.loc[gb.groups['foo']]

因为这样您也可以选择多个列。例如:

df.loc[gb.groups['foo'],('A','B')]

Rather than

gb.get_group('foo')

I prefer using gb.groups

df.loc[gb.groups['foo']]

Because in this way you can choose multiple columns as well. for example:

df.loc[gb.groups['foo'],('A','B')]

回答 3

gb = df.groupby(['A'])

gb_groups = grouped_df.groups

如果要查找选择性的groupby对象,请执行:gb_groups.keys(),然后将所需的密钥输入到以下key_list中。

gb_groups.keys()

key_list = [key1, key2, key3 and so on...]

for key, values in gb_groups.iteritems():
    if key in key_list:
        print df.ix[values], "\n"
gb = df.groupby(['A'])

gb_groups = grouped_df.groups

If you are looking for selective groupby objects then, do: gb_groups.keys(), and input desired key into the following key_list..

gb_groups.keys()

key_list = [key1, key2, key3 and so on...]

for key, values in gb_groups.iteritems():
    if key in key_list:
        print df.ix[values], "\n"

回答 4

我正在寻找对GroupBy obj的几个成员进行抽样的方法-必须解决发布的问题才能完成此任务。

创建分组对象

grouped = df.groupby('some_key')

选择N个数据框并获取其索引

sampled_df_i  = random.sample(grouped.indicies, N)

抢团体

df_list  = map(lambda df_i: grouped.get_group(df_i), sampled_df_i)

可选-将所有内容重新转换为单个dataframe对象

sampled_df = pd.concat(df_list, axis=0, join='outer')

I was looking for a way to sample a few members of the GroupBy obj – had to address the posted question to get this done.

create groupby object

grouped = df.groupby('some_key')

pick N dataframes and grab their indicies

sampled_df_i  = random.sample(grouped.indicies, N)

grab the groups

df_list  = map(lambda df_i: grouped.get_group(df_i), sampled_df_i)

optionally – turn it all back into a single dataframe object

sampled_df = pd.concat(df_list, axis=0, join='outer')

熊猫:求和给定列的DataFrame行

问题:熊猫:求和给定列的DataFrame行

我有以下DataFrame:

In [1]:

import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df
Out [1]:
   a  b   c  d
0  1  2  dd  5
1  2  3  ee  9
2  3  4  ff  1

我想增加一列'e'是列的总和'a''b''d'

在各个论坛上,我认为这样会起作用:

df['e'] = df[['a','b','d']].map(sum)

但事实并非如此。

我想知道适当的操作与列的列表['a','b','d']df作为输入。

I have the following DataFrame:

In [1]:

import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df
Out [1]:
   a  b   c  d
0  1  2  dd  5
1  2  3  ee  9
2  3  4  ff  1

I would like to add a column 'e' which is the sum of column 'a', 'b' and 'd'.

Going across forums, I thought something like this would work:

df['e'] = df[['a','b','d']].map(sum)

But it didn’t.

I would like to know the appropriate operation with the list of columns ['a','b','d'] and df as inputs.


回答 0

您可以sum设置参数axis=1以对行求和,这将忽略任何数字列:

In [91]:

df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
   a  b   c  d   e
0  1  2  dd  5   8
1  2  3  ee  9  14
2  3  4  ff  1   8

如果您只想汇总特定的列,则可以创建列的列表并删除不感兴趣的列:

In [98]:

col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:

df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
   a  b   c  d  e
0  1  2  dd  5  3
1  2  3  ee  9  5
2  3  4  ff  1  7

You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:

In [91]:

df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
   a  b   c  d   e
0  1  2  dd  5   8
1  2  3  ee  9  14
2  3  4  ff  1   8

If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:

In [98]:

col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:

df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
   a  b   c  d  e
0  1  2  dd  5  3
1  2  3  ee  9  5
2  3  4  ff  1  7

回答 1

如果您只需要汇总几列,则可以编写:

df['e'] = df['a'] + df['b'] + df['d']

这将创建e具有以下值的新列:

   a  b   c  d   e
0  1  2  dd  5   8
1  2  3  ee  9  14
2  3  4  ff  1   8

对于较长的列列表,首选EdChum的答案。

If you have just a few columns to sum, you can write:

df['e'] = df['a'] + df['b'] + df['d']

This creates new column e with the values:

   a  b   c  d   e
0  1  2  dd  5   8
1  2  3  ee  9  14
2  3  4  ff  1   8

For longer lists of columns, EdChum’s answer is preferred.


回答 2

创建要添加的列名列表。

df['total']=df.loc[:,list_name].sum(axis=1)

如果要某些行的总和,请使用“:”指定行

Create a list of column names you want to add up.

df['total']=df.loc[:,list_name].sum(axis=1)

If you want the sum for certain rows, specify the rows using ‘:’


回答 3

这是使用iloc选择要累加的列的更简单方法:

df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)

生成:

   a  b   c  d   e  f  g   h
0  1  2  dd  5   8  3  3   6
1  2  3  ee  9  14  5  5  11
2  3  4  ff  1   8  7  7   4

我找不到一种将范围和特定列结合起来的方法,例如:

df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)

This is a simpler way using iloc to select which columns to sum:

df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)

Produces:

   a  b   c  d   e  f  g   h
0  1  2  dd  5   8  3  3   6
1  2  3  ee  9  14  5  5  11
2  3  4  ff  1   8  7  7   4

I can’t find a way to combine a range and specific columns that works e.g. something like:

df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)

回答 4

当我按顺序排列列时,以下语法对我有帮助

awards_frame.values[:,1:4].sum(axis =1)

Following syntax helped me when I have columns in sequence

awards_frame.values[:,1:4].sum(axis =1)

回答 5

您只需将数据框传递给以下函数即可

def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
    frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
    return(frame)

范例

我有一个数据框(awards_frame)如下:

在此处输入图片说明

…并且我想创建一个新列,显示每一行的奖励总和

用法

我只是通过我的awards_frame进入功能,同时指定名称的新列的,和列表将被归纳列名:

sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])

结果

在此处输入图片说明

You can simply pass your dataframe into the following function:

def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
    frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
    return(frame)

Example:

I have a dataframe (awards_frame) as follows:

enter image description here

…and I want to create a new column that shows the sum of awards for each row:

Usage:

I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:

sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])

Result:

enter image description here


回答 6

这里最简单的方法是使用

    df.eval('e = a + b + d')

The shortest and simpliest way here is to use

    df.eval('e = a + b + d')

使用int的python dataframe pandas drop column

问题:使用int的python dataframe pandas drop column

我知道要删除列,您可以使用df.drop(’column name’,axis = 1)。有没有一种方法可以使用数字索引而不是列名来删除列?

I understand that to drop a column you use df.drop(‘column name’, axis=1). Is there a way to drop a column using a numerical index instead of the column name?


回答 0

您可以i像这样删除索引上的列:

df.drop(df.columns[i], axis=1)

如果列中有重复的名称,这可能会很奇怪,因此,您可以重命名要用新名称删除列的列。或者,您可以像这样重新分配DataFrame:

df = df.iloc[:, [j for j, c in enumerate(df.columns) if j != i]]

You can delete column on i index like this:

df.drop(df.columns[i], axis=1)

It could work strange, if you have duplicate names in columns, so to do this you can rename column you want to delete column by new name. Or you can reassign DataFrame like this:

df = df.iloc[:, [j for j, c in enumerate(df.columns) if j != i]]

回答 1

像这样删除多列:

cols = [1,2,4,5,12]
df.drop(df.columns[cols],axis=1,inplace=True)

inplace=True用于在数据框本身中进行更改,而无需将列放在数据框的副本上。如果您需要保持原样,请使用:

df_after_dropping = df.drop(df.columns[cols],axis=1)

Drop multiple columns like this:

cols = [1,2,4,5,12]
df.drop(df.columns[cols],axis=1,inplace=True)

inplace=True is used to make the changes in the dataframe itself without doing the column dropping on a copy of the data frame. If you need to keep your original intact, use:

df_after_dropping = df.drop(df.columns[cols],axis=1)

回答 2

如果存在多个具有相同名称的列,那么到目前为止给出的解决方案将删除所有列,而这可能并不是所要查找的。如果尝试删除一个实例以外的重复列,则可能是这种情况。下面的示例阐明了这种情况:

# make a df with duplicate columns 'x'
df = pd.DataFrame({'x': range(5) , 'x':range(5), 'y':range(6, 11)}, columns = ['x', 'x', 'y']) 


df
Out[495]: 
   x  x   y
0  0  0   6
1  1  1   7
2  2  2   8
3  3  3   9
4  4  4  10

# attempting to drop the first column according to the solution offered so far     
df.drop(df.columns[0], axis = 1) 
   y
0  6
1  7
2  8
3  9
4  10

如您所见,两个Xs列均被删除。替代解决方案:

column_numbers = [x for x in range(df.shape[1])]  # list of columns' integer indices

column_numbers .remove(0) #removing column integer index 0
df.iloc[:, column_numbers] #return all columns except the 0th column

   x  y
0  0  6
1  1  7
2  2  8
3  3  9
4  4  10

如您所见,这确实删除了仅第0列(第一个“ x”)。

If there are multiple columns with identical names, the solutions given here so far will remove all of the columns, which may not be what one is looking for. This may be the case if one is trying to remove duplicate columns except one instance. The example below clarifies this situation:

# make a df with duplicate columns 'x'
df = pd.DataFrame({'x': range(5) , 'x':range(5), 'y':range(6, 11)}, columns = ['x', 'x', 'y']) 


df
Out[495]: 
   x  x   y
0  0  0   6
1  1  1   7
2  2  2   8
3  3  3   9
4  4  4  10

# attempting to drop the first column according to the solution offered so far     
df.drop(df.columns[0], axis = 1) 
   y
0  6
1  7
2  8
3  9
4  10

As you can see, both Xs columns were dropped. Alternative solution:

column_numbers = [x for x in range(df.shape[1])]  # list of columns' integer indices

column_numbers .remove(0) #removing column integer index 0
df.iloc[:, column_numbers] #return all columns except the 0th column

   x  y
0  0  6
1  1  7
2  2  8
3  3  9
4  4  10

As you can see, this truly removed only the 0th column (first ‘x’).


回答 3

您需要根据列在数据框中的位置来标识它们。例如,如果您要删除(删除)第2,3和5列,它将是

df.drop(df.columns[[2,3,5]], axis = 1)

You need to identify the columns based on their position in dataframe. For example, if you want to drop (del) column number 2,3 and 5, it will be,

df.drop(df.columns[[2,3,5]], axis = 1)

回答 4

如果您有两个具有相同名称的列。一种简单的方法是像这样手动重命名列:

df.columns = ['column1', 'column2', 'column3']

然后,您可以根据需要通过列索引进行删除,如下所示:-

df.drop(df.columns[1], axis=1, inplace=True)

df.column[1] 将删除索引1。

请记住,轴1 =列,轴0 =行。

If you have two columns with the same name. One simple way is to manually rename the columns like this:-

df.columns = ['column1', 'column2', 'column3']

Then you can drop via column index as you requested, like this:-

df.drop(df.columns[1], axis=1, inplace=True)

df.column[1] will drop index 1.

Remember axis 1 = columns and axis 0 = rows.


回答 5

如果您真的想使用整数(但是为什么呢?),则可以构建一个字典。

col_dict = {x: col for x, col in enumerate(df.columns)}

然后df = df.drop(col_dict[0], 1)将按需要工作

编辑:您可以将其放入为您执行此操作的函数中,尽管这样,每次调用它时都会创建字典

def drop_col_n(df, col_n_to_drop):
    col_dict = {x: col for x, col in enumerate(df.columns)}
    return df.drop(col_dict[col_n_to_drop], 1)

df = drop_col_n(df, 2)

if you really want to do it with integers (but why?), then you could build a dictionary.

col_dict = {x: col for x, col in enumerate(df.columns)}

then df = df.drop(col_dict[0], 1) will work as desired

edit: you can put it in a function that does that for you, though this way it creates the dictionary every time you call it

def drop_col_n(df, col_n_to_drop):
    col_dict = {x: col for x, col in enumerate(df.columns)}
    return df.drop(col_dict[col_n_to_drop], 1)

df = drop_col_n(df, 2)

回答 6

您可以使用以下行删除前两列(或不需要的任何列):

df.drop([df.columns[0], df.columns[1]], axis=1)

参考

You can use the following line to drop the first two columns (or any column you don’t need):

df.drop([df.columns[0], df.columns[1]], axis=1)

Reference


回答 7

由于可以有多个具有相同名称的列,我们应该首先重命名这些列。这是解决方案的代码。

df.columns=list(range(0,len(df.columns)))
df.drop(columns=[1,2])#drop second and third columns

Since there can be multiple columns with same name , we should first rename the columns. Here is code for the solution.

df.columns=list(range(0,len(df.columns)))
df.drop(columns=[1,2])#drop second and third columns

如何摆脱熊猫DataFrame中的“未命名:0”列?

问题:如何摆脱熊猫DataFrame中的“未命名:0”列?

我遇到一种情况,有时当我csv从中读取时,会df得到一个不需要的类似索引的列,名为unnamed:0

file.csv

,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9

CSV读取与此:

pd.read_csv('file.csv')

   Unnamed: 0  A  B  C
0           0  1  2  3
1           1  4  5  6
2           2  7  8  9

这很烦人!有谁知道如何摆脱这一点?

I have a situation wherein sometimes when I read a csv from df I get an unwanted index-like column named unnamed:0.

file.csv

,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9

The CSV is read with this:

pd.read_csv('file.csv')

   Unnamed: 0  A  B  C
0           0  1  2  3
1           1  4  5  6
2           2  7  8  9

This is very annoying! Does anyone have an idea on how to get rid of this?


回答 0

它是索引列,请传递index=False以不将其写出,请参阅文档

例:

In [37]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
pd.read_csv(io.StringIO(df.to_csv()))

Out[37]:
   Unnamed: 0         a         b         c
0           0  0.109066 -1.112704 -0.545209
1           1  0.447114  1.525341  0.317252
2           2  0.507495  0.137863  0.886283
3           3  1.452867  1.888363  1.168101
4           4  0.901371 -0.704805  0.088335

与之比较:

In [38]:
pd.read_csv(io.StringIO(df.to_csv(index=False)))

Out[38]:
          a         b         c
0  0.109066 -1.112704 -0.545209
1  0.447114  1.525341  0.317252
2  0.507495  0.137863  0.886283
3  1.452867  1.888363  1.168101
4  0.901371 -0.704805  0.088335

您还可以选择read_csv通过传递index_col=0以下内容来判断第一列是索引列:

In [40]:
pd.read_csv(io.StringIO(df.to_csv()), index_col=0)

Out[40]:
          a         b         c
0  0.109066 -1.112704 -0.545209
1  0.447114  1.525341  0.317252
2  0.507495  0.137863  0.886283
3  1.452867  1.888363  1.168101
4  0.901371 -0.704805  0.088335

It’s the index column, pass index=False to not write it out, see the docs

Example:

In [37]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
pd.read_csv(io.StringIO(df.to_csv()))

Out[37]:
   Unnamed: 0         a         b         c
0           0  0.109066 -1.112704 -0.545209
1           1  0.447114  1.525341  0.317252
2           2  0.507495  0.137863  0.886283
3           3  1.452867  1.888363  1.168101
4           4  0.901371 -0.704805  0.088335

compare with:

In [38]:
pd.read_csv(io.StringIO(df.to_csv(index=False)))

Out[38]:
          a         b         c
0  0.109066 -1.112704 -0.545209
1  0.447114  1.525341  0.317252
2  0.507495  0.137863  0.886283
3  1.452867  1.888363  1.168101
4  0.901371 -0.704805  0.088335

You could also optionally tell read_csv that the first column is the index column by passing index_col=0:

In [40]:
pd.read_csv(io.StringIO(df.to_csv()), index_col=0)

Out[40]:
          a         b         c
0  0.109066 -1.112704 -0.545209
1  0.447114  1.525341  0.317252
2  0.507495  0.137863  0.886283
3  1.452867  1.888363  1.168101
4  0.901371 -0.704805  0.088335

回答 1

由于您的CSV及其CSV文件RangeIndex(通常没有名称)一起保存,因此很可能会出现此问题。在保存DataFrame时,实际上需要完成此修复,但这并不总是一种选择。

避免问题:read_csv带有index_col 参数

IMO,最简单的解决方案是将未命名的列作为index读取。将index_col=[0]参数指定为pd.read_csv,它将在第一列中读取作为索引。

df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

# Save DataFrame to CSV.
df.to_csv('file.csv')

pd.read_csv('file.csv')

   Unnamed: 0  a  b  c
0           0  x  x  x
1           1  x  x  x
2           2  x  x  x
3           3  x  x  x
4           4  x  x  x

# Now try this again, with the extra argument.
pd.read_csv('file.csv', index_col=[0])

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

注意如果DataFrame没有索引开头,则可以
通过index=False在创建输出CSV时使用来避免这种情况。

df.to_csv('file.csv', index=False)

但是如上所述,这并不总是一种选择。


权宜之计解决方案:过滤 str.match

如果您无法修改代码以读取/写入CSV文件,则可以通过使用str.match以下:

df 

   Unnamed: 0  a  b  c
0           0  x  x  x
1           1  x  x  x
2           2  x  x  x
3           3  x  x  x
4           4  x  x  x

df.columns
# Index(['Unnamed: 0', 'a', 'b', 'c'], dtype='object')

df.columns.str.match('Unnamed')
# array([ True, False, False, False])

df.loc[:, ~df.columns.str.match('Unnamed')]

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

This issue most likely manifests because your CSV was saved along with its RangeIndex (which usually doesn’t have a name). The fix would actually need to be done when saving the DataFrame, but this isn’t always an option.

Avoiding the Problem: read_csv with index_col argument

IMO, the simplest solution would be to read the unnamed column as the index. Specify an index_col=[0] argument to pd.read_csv, this reads in the first column as the index.

df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

# Save DataFrame to CSV.
df.to_csv('file.csv')

pd.read_csv('file.csv')

   Unnamed: 0  a  b  c
0           0  x  x  x
1           1  x  x  x
2           2  x  x  x
3           3  x  x  x
4           4  x  x  x

# Now try this again, with the extra argument.
pd.read_csv('file.csv', index_col=[0])

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

Note
You could have avoided this in the first place by using index=False when creating the output CSV, if your DataFrame does not have an index to begin with.

df.to_csv('file.csv', index=False)

But as mentioned above, this isn’t always an option.


Stopgap Solution: Filtering with str.match

If you cannot modify the code to read/write the CSV file, you can just remove the column by filtering with str.match:

df 

   Unnamed: 0  a  b  c
0           0  x  x  x
1           1  x  x  x
2           2  x  x  x
3           3  x  x  x
4           4  x  x  x

df.columns
# Index(['Unnamed: 0', 'a', 'b', 'c'], dtype='object')

df.columns.str.match('Unnamed')
# array([ True, False, False, False])

df.loc[:, ~df.columns.str.match('Unnamed')]

   a  b  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

回答 2

可能发生这种情况的另一种情况是,如果您的数据被不正确地写入到您的数据中,以致csv每一行都以逗号结尾。Unnamed: x当您尝试将数据读入时,这将在数据末尾留下一个未命名的列df

Another case that this might be happening is if your data was improperly written to your csv to have each row end with a comma. This will leave you with an unnamed column Unnamed: x at the end of your data when you try to read it into a df.


回答 3

要使用所有未命名列,您还可以使用正则表达式,例如 df.drop(df.filter(regex="Unname"),axis=1, inplace=True)

To get ride of all Unnamed columns, you can also use regex such as df.drop(df.filter(regex="Unname"),axis=1, inplace=True)


回答 4

只需使用以下命令删除该列: del df['column_name']

Simply delete that column using: del df['column_name']


熊猫中布尔索引的逻辑运算符

问题:熊猫中布尔索引的逻辑运算符

我正在Pandas中使用布尔值索引。问题是为什么声明:

a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]

工作正常而

a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]

退出错误?

例:

a=pd.DataFrame({'x':[1,1],'y':[10,20]})

In: a[(a['x']==1)&(a['y']==10)]
Out:    x   y
     0  1  10

In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous.     Use a.any() or a.all()

I’m working with boolean index in Pandas. The question is why the statement:

a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]

works fine whereas

a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]

exits with error?

Example:

a=pd.DataFrame({'x':[1,1],'y':[10,20]})

In: a[(a['x']==1)&(a['y']==10)]
Out:    x   y
     0  1  10

In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous.     Use a.any() or a.all()

回答 0

当你说

(a['x']==1) and (a['y']==10)

您暗中要求Python进行转换(a['x']==1)并转换(a['y']==10)为布尔值。

NumPy数组(长度大于1)和Pandas对象(例如Series)没有布尔值-换句话说,它们引发

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

当用作布尔值时。这是因为尚不清楚何时应为True或False。如果某些用户的长度非零,则可能会认为它们为True,例如Python列表。其他人可能只希望它的所有元素都为真,才希望它为真。如果其他任何元素为True,则其他人可能希望它为True。

由于期望值如此之多,因此NumPy和Pandas的设计师拒绝猜测,而是提出了ValueError。

相反,你必须是明确的,通过调用empty()all()any()方法来表示你的愿望是什么行为。

但是,在这种情况下,您似乎不希望布尔值求值,而是希望按元素进行逻辑与。这就是&二进制运算符执行的操作:

(a['x']==1) & (a['y']==10)

返回一个布尔数组。


顺便说一句,正如alexpmil所指出的,括号是强制性的,因为&运算符优先级高于==。如果没有括号,a['x']==1 & a['y']==10则将被评估为a['x'] == (1 & a['y']) == 10等效于链式比较(a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10)。那是形式的表达Series and Seriesand与两个Series一起使用将再次触发与ValueError上述相同的操作。这就是为什么括号是强制性的。

When you say

(a['x']==1) and (a['y']==10)

You are implicitly asking Python to convert (a['x']==1) and (a['y']==10) to boolean values.

NumPy arrays (of length greater than 1) and Pandas objects such as Series do not have a boolean value — in other words, they raise

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

when used as a boolean value. That’s because its unclear when it should be True or False. Some users might assume they are True if they have non-zero length, like a Python list. Others might desire for it to be True only if all its elements are True. Others might want it to be True if any of its elements are True.

Because there are so many conflicting expectations, the designers of NumPy and Pandas refuse to guess, and instead raise a ValueError.

Instead, you must be explicit, by calling the empty(), all() or any() method to indicate which behavior you desire.

In this case, however, it looks like you do not want boolean evaluation, you want element-wise logical-and. That is what the & binary operator performs:

(a['x']==1) & (a['y']==10)

returns a boolean array.


By the way, as alexpmil notes, the parentheses are mandatory since & has a higher operator precedence than ==. Without the parentheses, a['x']==1 & a['y']==10 would be evaluated as a['x'] == (1 & a['y']) == 10 which would in turn be equivalent to the chained comparison (a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10). That is an expression of the form Series and Series. The use of and with two Series would again trigger the same ValueError as above. That’s why the parentheses are mandatory.


回答 1

TLDR;在熊猫逻辑运算符&|~和括号(...)是很重要的!

Python的andornot逻辑运算符的设计与标量的工作。因此,Pandas必须做得更好,并覆盖按位运算符,以实现此功能的矢量化(逐元素)版本。

因此,以下是python中的(exp1以及exp2是计算结果为布尔结果的表达式)…

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

…将转换为…

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

大熊猫。

如果在执行逻辑运算的过程中得到ValueError,则需要使用括号进行分组:

(exp1) op (exp2)

例如,

(df['col1'] == x) & (df['col2'] == y) 

等等。


布尔索引:常见的操作是通过逻辑条件来计算布尔掩码,以过滤数据。熊猫提供了三种运算符:&用于逻辑与,|逻辑或,以及~逻辑非。

请考虑以下设置:

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

逻辑与

对于df上面的内容,假设您想返回A <5和B> 5的所有行。这是通过分别计算每个条件的掩码并将它们与与完成的。

重载的按位&运算符
在继续之前,请注意文档的此特定摘录,其中指出

另一个常见的操作是使用布尔向量来过滤数据。运算符是:|for or&for and~for not这些必须通过使用括号来分组,由于由默认的Python将评估的表达式如df.A > 2 & df.B < 3df.A > (2 & df.B) < 3,而所期望的评价顺序是(df.A > 2) & (df.B < 3)

因此,考虑到这一点,可以使用按位运算符实现按元素逻辑与&

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

接下来的过滤步骤很简单,

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

括号用于覆盖按位运算符的默认优先级顺序,后者比条件运算符<和具有更高的优先级>。请参阅python文档中的“ 运算符优先级 ”部分。

如果不使用括号,则表达式的计算不正确。例如,如果您不小心尝试了诸如

df['A'] < 5 & df['B'] > 5

它被解析为

df['A'] < (5 & df['B']) > 5

变成

df['A'] < something_you_dont_want > 5

变成了(请参阅有关链接运算符比较的python文档),

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

变成

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

哪个抛出

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

所以,不要犯那个错误!1个

避免括号分组
该修补程序实际上非常简单。大多数运算符都有对应的DataFrame绑定方法。如果使用函数而不是条件运算符来构建单个掩码,则不再需要按括号分组以指定评估顺序:

df['A'].lt(5)

0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

请参阅“ 灵活比较 ”部分。总而言之,我们有

╒════╤════════════╤════════════╕
     Operator    Function   
╞════╪════════════╪════════════╡
  0  >           gt         
├────┼────────────┼────────────┤
  1  >=          ge         
├────┼────────────┼────────────┤
  2  <           lt         
├────┼────────────┼────────────┤
  3  <=          le         
├────┼────────────┼────────────┤
  4  ==          eq         
├────┼────────────┼────────────┤
  5  !=          ne         
╘════╧════════════╧════════════╛

避免括号的另一种方法是使用DataFrame.query(或eval):

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

我已经广泛地记录queryeval使用pd.eval动态表达评价大熊猫()

operator.and_
允许您以功能方式执行此操作。内部调用Series.__and__,它对应于按位运算符。

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

您通常不需要此功能,但了解它很有用。

概括:(np.logical_andlogical_and.reduce
另一种替代方法是使用 np.logical_and,它也不需要括号分组:

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_and是ufunc (通用函数),大多数ufunc都有一个reduce方法。这意味着logical_and如果对AND有多个掩码,则更容易一概而论。例如,对于AND遮罩m1以及m2m3&,您必须要做

m1 & m2 & m3

但是,一个更简单的选择是

np.logical_and.reduce([m1, m2, m3])

这很强大,因为它使您可以使用更复杂的逻辑在此基础上构建(例如,在列表理解中动态生成掩码并添加所有掩码):

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

1-我知道我在这一点上很困难,但是请耐心等待。这是一个非常非常常见的初学者的错误,必须非常有详尽的解释。


逻辑或

对于df上述内容,假设您想返回A == 3或B == 7的所有行。

按位重载 |

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7

0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

如果您还没有阅读过,请同时阅读上面“ 逻辑与 ”部分,所有注意事项均在此处适用。

或者,可以使用

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

operator.or_
在后台调用Series.__or__

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

np.logical_or
对于两个条件,请使用logical_or

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

对于多个口罩,请使用logical_or.reduce

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

逻辑非

给定口罩,例如

mask = pd.Series([True, True, False])

如果您需要反转每个布尔值(以使最终结果为[False, False, True]),则可以使用以下任何方法。

按位 ~

~mask

0    False
1    False
2     True
dtype: bool

同样,表达式需要加上括号。

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

这在内部调用

mask.__invert__()

0    False
1    False
2     True
dtype: bool

但是不要直接使用它。

operator.inv
内部调用__invert__该系列。

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

np.logical_not
这是numpy的变体。

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool

注意,np.logical_and可以代替np.bitwise_andlogical_orbitwise_or,并logical_notinvert

TLDR; Logical Operators in Pandas are &, | and ~, and parentheses (...) is important!

Python’s and, or and not logical operators are designed to work with scalars. So Pandas had to do one better and override the bitwise operators to achieve vectorized (element-wise) version of this functionality.

So the following in python (exp1 and exp2 are expressions which evaluate to a boolean result)…

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

…will translate to…

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

for pandas.

If in the process of performing logical operation you get a ValueError, then you need to use parentheses for grouping:

(exp1) op (exp2)

For example,

(df['col1'] == x) & (df['col2'] == y) 

And so on.


Boolean Indexing: A common operation is to compute boolean masks through logical conditions to filter the data. Pandas provides three operators: & for logical AND, | for logical OR, and ~ for logical NOT.

Consider the following setup:

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

Logical AND

For df above, say you’d like to return all rows where A < 5 and B > 5. This is done by computing masks for each condition separately, and ANDing them.

Overloaded Bitwise & Operator
Before continuing, please take note of this particular excerpt of the docs, which state

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3).

So, with this in mind, element wise logical AND can be implemented with the bitwise operator &:

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

And the subsequent filtering step is simply,

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

The parentheses are used to override the default precedence order of bitwise operators, which have higher precedence over the conditional operators < and >. See the section of Operator Precedence in the python docs.

If you do not use parentheses, the expression is evaluated incorrectly. For example, if you accidentally attempt something such as

df['A'] < 5 & df['B'] > 5

It is parsed as

df['A'] < (5 & df['B']) > 5

Which becomes,

df['A'] < something_you_dont_want > 5

Which becomes (see the python docs on chained operator comparison),

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

Which becomes,

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

Which throws

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So, don’t make that mistake!1

Avoiding Parentheses Grouping
The fix is actually quite simple. Most operators have a corresponding bound method for DataFrames. If the individual masks are built up using functions instead of conditional operators, you will no longer need to group by parens to specify evaluation order:

df['A'].lt(5)

0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

See the section on Flexible Comparisons.. To summarise, we have

╒════╤════════════╤════════════╕
│    │ Operator   │ Function   │
╞════╪════════════╪════════════╡
│  0 │ >          │ gt         │
├────┼────────────┼────────────┤
│  1 │ >=         │ ge         │
├────┼────────────┼────────────┤
│  2 │ <          │ lt         │
├────┼────────────┼────────────┤
│  3 │ <=         │ le         │
├────┼────────────┼────────────┤
│  4 │ ==         │ eq         │
├────┼────────────┼────────────┤
│  5 │ !=         │ ne         │
╘════╧════════════╧════════════╛

Another option for avoiding parentheses is to use DataFrame.query (or eval):

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

I have extensively documented query and eval in Dynamic Expression Evaluation in pandas using pd.eval().

operator.and_
Allows you to perform this operation in a functional manner. Internally calls Series.__and__ which corresponds to the bitwise operator.

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

You won’t usually need this, but it is useful to know.

Generalizing: np.logical_and (and logical_and.reduce)
Another alternative is using np.logical_and, which also does not need parentheses grouping:

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_and is a ufunc (Universal Functions), and most ufuncs have a reduce method. This means it is easier to generalise with logical_and if you have multiple masks to AND. For example, to AND masks m1 and m2 and m3 with &, you would have to do

m1 & m2 & m3

However, an easier option is

np.logical_and.reduce([m1, m2, m3])

This is powerful, because it lets you build on top of this with more complex logic (for example, dynamically generating masks in a list comprehension and adding all of them):

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

1 – I know I’m harping on this point, but please bear with me. This is a very, very common beginner’s mistake, and must be explained very thoroughly.


Logical OR

For the df above, say you’d like to return all rows where A == 3 or B == 7.

Overloaded Bitwise |

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7

0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

If you haven’t yet, please also read the section on Logical AND above, all caveats apply here.

Alternatively, this operation can be specified with

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

operator.or_
Calls Series.__or__ under the hood.

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

np.logical_or
For two conditions, use logical_or:

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

For multiple masks, use logical_or.reduce:

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

Logical NOT

Given a mask, such as

mask = pd.Series([True, True, False])

If you need to invert every boolean value (so that the end result is [False, False, True]), then you can use any of the methods below.

Bitwise ~

~mask

0    False
1    False
2     True
dtype: bool

Again, expressions need to be parenthesised.

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

This internally calls

mask.__invert__()

0    False
1    False
2     True
dtype: bool

But don’t use it directly.

operator.inv
Internally calls __invert__ on the Series.

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

np.logical_not
This is the numpy variant.

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool

Note, np.logical_and can be substituted for np.bitwise_and, logical_or with bitwise_or, and logical_not with invert.


回答 2

熊猫中布尔索引的逻辑运算符

要认识到,你不能使用任何的Python是很重要的逻辑运算符andornot上)pandas.Seriespandas.DataFrameS(同样,你不能上使用它们numpy.array有一个以上的元素S)。之所以不能使用它们,是因为它们隐式地调用bool其操作数,从而引发异常,因为这些数据结构确定数组的布尔值是不明确的:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我确实在回答“系列的真值不明确。请使用a.empty,a.bool(),a.item(),a.any()或a.all()”时回答这个问题。 + A

NumPys逻辑功能

然而NumPy的提供逐元素的操作等同于这些运营商的功能,可以在被使用numpy.arraypandas.Seriespandas.DataFrame,或任何其他(符合)numpy.array亚类:

因此,从本质上讲,应该使用(假设df1并且df2是pandas DataFrames):

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

布尔的按位函数和按位运算符

但是,如果您具有布尔NumPy数组,pandas系列或pandas DataFrame,则也可以使用按元素逐位的函数(对于布尔,它们与逻辑函数是(或至少应该是)不可区分的):

通常使用运算符。但是,当与比较运算符组合使用时,必须记住将比较括在括号中,因为按位运算符的优先级高于比较运算符

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

这可能很烦人,因为Python逻辑运算符的优先级比比较运算符的优先级低,因此您通常可以编写a < 10 and b > 10(其中a并且b是简单整数),并且不需要括号。

逻辑和按位运算之间的差异(非布尔值)

需要特别强调的是,位和逻辑运算仅对布尔NumPy数组(以及布尔Series和DataFrame)是等效的。如果这些不包含布尔值,则这些操作将给出不同的结果。我将包括使用NumPy数组的示例,但对于pandas数据结构,结果将相似:

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

而且由于NumPy(和类似的pandas)对boolean(布尔或“掩码”索引数组)和integer(Index数组)索引所做的操作不同,因此索引的结果也将不同:

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

汇总表

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

其中的逻辑运算符不适合与NumPy阵列工作,熊猫系列,和熊猫DataFrames。其他的则在这些数据结构(和普通的Python对象)上工作,并在元素方面工作。但是,请对纯Python上的按位取反要小心,bool因为在这种情况下,布尔将被解释为整数(例如~Falsereturn -1~Truereturn -2)。

Logical operators for boolean indexing in Pandas

It’s important to realize that you cannot use any of the Python logical operators (and, or or not) on pandas.Series or pandas.DataFrames (similarly you cannot use them on numpy.arrays with more than one element). The reason why you cannot use those is because they implicitly call bool on their operands which throws an Exception because these data structures decided that the boolean of an array is ambiguous:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I did cover this more extensively in my answer to the “Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()” Q+A.

NumPys logical functions

However NumPy provides element-wise operating equivalents to these operators as functions that can be used on numpy.array, pandas.Series, pandas.DataFrame, or any other (conforming) numpy.array subclass:

So, essentially, one should use (assuming df1 and df2 are pandas DataFrames):

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

Bitwise functions and bitwise operators for booleans

However in case you have boolean NumPy array, pandas Series, or pandas DataFrames you could also use the element-wise bitwise functions (for booleans they are – or at least should be – indistinguishable from the logical functions):

Typically the operators are used. However when combined with comparison operators one has to remember to wrap the comparison in parenthesis because the bitwise operators have a higher precedence than the comparison operators:

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

This may be irritating because the Python logical operators have a lower precendence than the comparison operators so you normally write a < 10 and b > 10 (where a and b are for example simple integers) and don’t need the parenthesis.

Differences between logical and bitwise operations (on non-booleans)

It is really important to stress that bit and logical operations are only equivalent for boolean NumPy arrays (and boolean Series & DataFrames). If these don’t contain booleans then the operations will give different results. I’ll include examples using NumPy arrays but the results will be similar for the pandas data structures:

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

And since NumPy (and similarly pandas) does different things for boolean (Boolean or “mask” index arrays) and integer (Index arrays) indices the results of indexing will be also be different:

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

Summary table

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

Where the logical operator does not work for NumPy arrays, pandas Series, and pandas DataFrames. The others work on these data structures (and plain Python objects) and work element-wise. However be careful with the bitwise invert on plain Python bools because the bool will be interpreted as integers in this context (for example ~False returns -1 and ~True returns -2).


熊猫可以自动识别日期吗?

问题:熊猫可以自动识别日期吗?

今天,我感到惊讶的是,pandas在从数据文件中读取数据时能够识别值的类型:

df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])

例如,可以通过以下方式检查它:

for i, r in df.iterrows():
    print type(r['col1']), type(r['col2']), type(r['col3'])

特别是整数,浮点数和字符串可以正确识别。但是,我有一列的日期采用以下格式:2013-6-4。这些日期被识别为字符串(而不是python日期对象)。有没有一种方法可以“学习”熊猫到公认的日期?

Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:

df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])

For example it can be checked in this way:

for i, r in df.iterrows():
    print type(r['col1']), type(r['col2']), type(r['col3'])

In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4. These dates were recognized as strings (not as python date-objects). Is there a way to “learn” pandas to recognized dates?


回答 0

您应该添加parse_dates=True,或者parse_dates=['column name']在阅读时通常足以神奇地解析它。但是总有一些奇怪的格式需要手动定义。在这种情况下,您还可以添加日期解析器功能,这是最灵活的方法。

假设您的字符串中有一列“ datetime”,然后:

dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

这样,您甚至可以将多个列合并为一个datetime列,从而将一个“ date”和一个“ time”列合并为一个“ datetime”列:

dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

您可以在此页面strptimestrftime 找到指令(即用于不同格式的字母)。

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column ‘datetime’ with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a ‘date’ and a ‘time’ column into a single ‘datetime’ column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.


回答 1

自@Rutger回答以来,熊猫界面可能已更改,但是在我使用的版本(0.15.2)中,该date_parser函数接收日期列表,而不是单个值。在这种情况下,他的代码应该这样更新:

dateparse = lambda dates: [pd.datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

Perhaps the pandas interface has changed since @Rutger answered, but in the version I’m using (0.15.2), the date_parser function receives a list of dates instead of a single value. In this case, his code should be updated like so:

dateparse = lambda dates: [pd.datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

回答 2

pandas read_csv方法非常适合解析日期。完整的文档位于http://pandas.pydata.org/pandas-docs/stable/genic/pandas.io.parsers.read_csv.html

您甚至可以在不同的列中包含不同的日期部分,并传递参数:

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo : [1, 3]} -> parse columns 1, 3 as date and call result foo

默认的日期检测效果很好,但似乎偏向于北美日期格式。如果您住在其他地方,您可能偶尔会被结果所吸引。据我所知,2000年1月6日是美国的1月6日,而不是我居住的6月1日。如果使用了2000年6月23日这样的日期,它足够聪明地摆弄它们。不过,使用YYYYMMDD日期变化可能更安全。向熊猫开发者表示歉意,但是最近我还没有在当地进行测试。

您可以使用date_parser参数传递一个函数来转换格式。

date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.

pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

you can even have the different date parts in different columns and pass the parameter:

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.

you can use the date_parser parameter to pass a function to convert your format.

date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.

回答 3

您可以pandas.to_datetime()按照文档中的建议使用pandas.read_csv()

如果列或索引包含不可解析的日期,则整个列或索引将按原样作为对象数据类型返回。对于非标准的日期时间解析,请pd.to_datetime在之后使用pd.read_csv

演示:

>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
       date
0  2013-6-4
>>> df.dtypes
date    object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
        date
0 2013-06-04
>>> df.dtypes
date    datetime64[ns]
dtype: object

You could use pandas.to_datetime() as recommended in the documentation for pandas.read_csv():

If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.

Demo:

>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
       date
0  2013-6-4
>>> df.dtypes
date    object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
        date
0 2013-06-04
>>> df.dtypes
date    datetime64[ns]
dtype: object

回答 4

将两列合并为一个datetime列时,可接受的答案会产生错误(pandas版本0.20.3),因为这些列分别发送到date_parser函数。

以下作品:

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.

The following works:

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

回答 5

是的-根据pandas.read_csv 文档

注意:存在iso8601格式日期的快速路径。

因此,如果您的csv有一个名为的列datetime,并且日期看起来像2013-01-01T01:01例如,运行此命令将使熊猫(我在v0.19.2上)自动获取日期和时间:

df = pd.read_csv('test.csv', parse_dates=['datetime'])

请注意,您需要显式传递parse_dates,否则将无法正常运行。

验证:

df.dtypes

您应该看到列的数据类型是 datetime64[ns]

Yes – according to the pandas.read_csv documentation:

Note: A fast-path exists for iso8601-formatted dates.

So if your csv has a column named datetime and the dates looks like 2013-01-01T01:01 for example, running this will make pandas (I’m on v0.19.2) pick up the date and time automatically:

df = pd.read_csv('test.csv', parse_dates=['datetime'])

Note that you need to explicitly pass parse_dates, it doesn’t work without.

Verify with:

df.dtypes

You should see the datatype of the column is datetime64[ns]


回答 6

如果性能对您很重要,请确保您有时间:

import sys
import timeit
import pandas as pd

print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)

repeat = 3
numbers = 100

def time(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

print("Format %m/%d/%y")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')

print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')

印刷品:

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996

因此,与ISO8601格式的日期(%Y-%m-%d %H:%M:%S显然是一个ISO8601格式的日期,我猜的T 可以被丢弃,并用空格代替),你应该指定infer_datetime_format(不使更多常见的两种明显的差异),并通过自己的解析器只会破坏性能。另一方面,date_parser与标准日期格式相比确实有所不同。像往常一样,请务必先确定时间再进行优化。

If performance matters to you make sure you time:

import sys
import timeit
import pandas as pd

print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)

repeat = 3
numbers = 100

def time(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

print("Format %m/%d/%y")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')

print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')

prints:

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996

So with iso8601-formatted date (%Y-%m-%d %H:%M:%S is apparently an iso8601-formatted date, I guess the T can be dropped and replaced by a space) you should not specify infer_datetime_format (which does not make a difference with more common ones either apparently) and passing your own parser in just cripples performance. On the other hand, date_parser does make a difference with not so standard day formats. Be sure to time before you optimize, as usual.


回答 7

加载csv文件中包含date列时,我们有两种方法可以使熊猫识别date列,即

  1. 熊猫通过arg明确识别格式 date_parser=mydateparser

  2. 熊猫通过AGR隐式识别格式 infer_datetime_format=True

一些日期列数据

18/01/18

18/02/02

这里我们不知道前两件事,可能是一个月或一天。因此,在这种情况下,我们必须使用方法1:-显式传递格式

    mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
    df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)

方法2:-隐式或自动识别格式

df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)

While loading csv file contain date column.We have two approach to to make pandas to recognize date column i.e

  1. Pandas explicit recognize the format by arg date_parser=mydateparser

  2. Pandas implicit recognize the format by agr infer_datetime_format=True

Some of the date column data

01/01/18

01/02/18

Here we don’t know the first two things It may be month or day. So in this case we have to use Method 1:- Explicit pass the format

    mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
    df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)

Method 2:- Implicit or Automatically recognize the format

df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)

熊猫仅使用列名创建空的DataFrame

问题:熊猫仅使用列名创建空的DataFrame

我有一个动态的DataFrame,它工作正常,但是当没有数据要添加到DataFrame中时,出现错误。因此,我需要一个解决方案以仅使用列名创建一个空的DataFrame。

现在我有这样的事情:

df = pd.DataFrame(columns=COLUMN_NAMES) # Note that there are now row data inserted.

PS:重要的是,列名仍应出现在DataFrame中。

但是当我这样使用它时,我得到的结果是这样的:

Index([], dtype='object')
Empty DataFrame

“空DataFrame”部分很好!但是,除了索引之外,我还需要显示列。

编辑:

我发现的一件重要事情:我正在使用Jinja2将此DataFrame转换为PDF,因此我在调出一种方法,首先将其输出为HTML,如下所示:

df.to_html()

我认为这是专栏迷路的地方。

Edit2:通常,我遵循以下示例:http : //pbpython.com/pdf-reports.html。CSS也来自链接。这就是我将数据帧发送到PDF的过程:

env = Environment(loader=FileSystemLoader('.'))
template = env.get_template("pdf_report_template.html")
template_vars = {"my_dataframe": df.to_html()}

html_out = template.render(template_vars)
HTML(string=html_out).write_pdf("my_pdf.pdf", stylesheets=["pdf_report_style.css"])

编辑3:

如果在创建后立即打印出数据框,则会得到以下信息:

[0 rows x 9 columns]
Empty DataFrame
Columns: [column_a, column_b, column_c, column_d, 
column_e, column_f, column_g, 
column_h, column_i]
Index: []

这似乎是合理的,但是如果我打印出template_vars:

'my_dataframe': '<table border="1" class="dataframe">\n  <tbody>\n    <tr>\n      <td>Index([], dtype=\'object\')</td>\n      <td>Empty DataFrame</td>\n    </tr>\n  </tbody>\n</table>'

似乎这些列已经丢失了。

E4:如果我打印出以下内容:

print(df.to_html())

我已经得到以下结果:

<table border="1" class="dataframe">
  <tbody>
    <tr>
      <td>Index([], dtype='object')</td>
      <td>Empty DataFrame</td>
    </tr>
  </tbody>
</table>

I have a dynamic DataFrame which works fine, but when there are no data to be added into the DataFrame I get an error. And therefore I need a solution to create an empty DataFrame with only the column names.

For now I have something like this:

df = pd.DataFrame(columns=COLUMN_NAMES) # Note that there are now row data inserted.

PS: It is important that the column names would still appear in a DataFrame.

But when I use it like this I get something like that as a result:

Index([], dtype='object')
Empty DataFrame

The “Empty DataFrame” part is good! But instead of the Index thing I need to still display the columns.

Edit:

An important thing that I found out: I am converting this DataFrame to a PDF using Jinja2, so therefore I’m calling out a method to first output it to HTML like that:

df.to_html()

This is where the columns get lost I think.

Edit2: In general, I followed this example: http://pbpython.com/pdf-reports.html. The css is also from the link. That’s what I do to send the dataframe to the PDF:

env = Environment(loader=FileSystemLoader('.'))
template = env.get_template("pdf_report_template.html")
template_vars = {"my_dataframe": df.to_html()}

html_out = template.render(template_vars)
HTML(string=html_out).write_pdf("my_pdf.pdf", stylesheets=["pdf_report_style.css"])

Edit3:

If I print out the dataframe right after creation I get the followin:

[0 rows x 9 columns]
Empty DataFrame
Columns: [column_a, column_b, column_c, column_d, 
column_e, column_f, column_g, 
column_h, column_i]
Index: []

That seems reasonable, but if I print out the template_vars:

'my_dataframe': '<table border="1" class="dataframe">\n  <tbody>\n    <tr>\n      <td>Index([], dtype=\'object\')</td>\n      <td>Empty DataFrame</td>\n    </tr>\n  </tbody>\n</table>'

And it seems that the columns are missing already.

E4: If I print out the following:

print(df.to_html())

I get the following result already:

<table border="1" class="dataframe">
  <tbody>
    <tr>
      <td>Index([], dtype='object')</td>
      <td>Empty DataFrame</td>
    </tr>
  </tbody>
</table>

回答 0

您可以使用列名称或索引创建一个空的DataFrame:

In [4]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
In [6]: df
Out[6]:
Empty DataFrame
Columns: [A, B, C, D, E, F, G]
Index: []

要么

In [7]: df = pd.DataFrame(index=range(1,10))
In [8]: df
Out[8]:
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7, 8, 9]

编辑:即使您对.to_html进行了修改,我也无法复制。这个:

df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
df.to_html('test.html')

生成:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
      <th>C</th>
      <th>D</th>
      <th>E</th>
      <th>F</th>
      <th>G</th>
    </tr>
  </thead>
  <tbody>
  </tbody>
</table>

You can create an empty DataFrame with either column names or an Index:

In [4]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
In [6]: df
Out[6]:
Empty DataFrame
Columns: [A, B, C, D, E, F, G]
Index: []

Or

In [7]: df = pd.DataFrame(index=range(1,10))
In [8]: df
Out[8]:
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7, 8, 9]

Edit: Even after your amendment with the .to_html, I can’t reproduce. This:

df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
df.to_html('test.html')

Produces:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
      <th>C</th>
      <th>D</th>
      <th>E</th>
      <th>F</th>
      <th>G</th>
    </tr>
  </thead>
  <tbody>
  </tbody>
</table>

回答 1

您是否正在寻找这样的东西?

    COLUMN_NAMES=['A','B','C','D','E','F','G']
    df = pd.DataFrame(columns=COLUMN_NAMES)
    df.columns

   Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

Are you looking for something like this?

    COLUMN_NAMES=['A','B','C','D','E','F','G']
    df = pd.DataFrame(columns=COLUMN_NAMES)
    df.columns

   Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

回答 2

df.to_html() 有一个columns参数。

只需将列传递到to_html()方法中即可。

df.to_html(columns=['A','B','C','D','E','F','G'])

df.to_html() has a columns parameter.

Just pass the columns into the to_html() method.

df.to_html(columns=['A','B','C','D','E','F','G'])

在熊猫中用NaN替换空白值(空白)

问题:在熊猫中用NaN替换空白值(空白)

我想在包含空格(任意数量)的Pandas数据框中找到所有值,并用NaN替换这些值。

有什么想法可以改善吗?

基本上我想把这个:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux     

变成这个:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

我已经用下面的代码做到了,但是这很丑。这不是Pythonic,而且我敢肯定,这也不是最有效的熊猫使用方式。我遍历每一列,并对通过应用对每个值进行正则表达式搜索(在空格上匹配)的函数生成的列掩码进行布尔替换。

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

通过仅迭代可能包含空字符串的字段,可以对它进行一些优化:

if df[i].dtype == np.dtype('object')

但这并没有太大的改善

最后,此代码将目标字符串设置为None,该字符串可与Pandas的like函数一起使用fillna(),但是如果我实际上可以NaN直接插入而不是,那么这样做对完整性很有帮助None

I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.

Any ideas how this can be improved?

Basically I want to turn this:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux     

Into this:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

I’ve managed to do it with the code below, but man is it ugly. It’s not Pythonic and I’m sure it’s not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

It could be optimized a bit by only iterating through fields that could contain empty strings:

if df[i].dtype == np.dtype('object')

But that’s not much of an improvement

And finally, this code sets the target strings to None, which works with Pandas’ functions like fillna(), but it would be nice for completeness if I could actually insert a NaN directly instead of None.


回答 0

我认为可以df.replace()做到,因为熊猫0.13

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))

生成:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

正如Temak指出的那样,请df.replace(r'^\s+$', np.nan, regex=True)在您的有效数据包含空格的情况下使用。

I think df.replace() does the job, since pandas 0.13:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))

Produces:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

As Temak pointed it out, use df.replace(r'^\s+$', np.nan, regex=True) in case your valid data contains white spaces.


回答 1

如果要替换空字符串并仅用空格记录,则正确答案是!:

df = df.replace(r'^\s*$', np.nan, regex=True)

接受的答案

df.replace(r'\s+', np.nan, regex=True)

不替换空字符串!,您可以尝试使用稍作更新的示例进行尝试:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

请注意,尽管’fo o’包含空格,但并未用Nan代替。进一步注意,这很简单:

df.replace(r'', np.NaN)

也不起作用-试试看。

If you want to replace an empty string and records with only spaces, the correct answer is!:

df = df.replace(r'^\s*$', np.nan, regex=True)

The accepted answer

df.replace(r'\s+', np.nan, regex=True)

Does not replace an empty string!, you can try yourself with the given example slightly updated:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Note, also that ‘fo o’ is not replaced with Nan, though it contains a space. Further note, that a simple:

df.replace(r'', np.NaN)

Does not work either – try it out.


回答 2

怎么样:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

applymap函数将一个函数应用于数据帧的每个单元。

How about:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

The applymap function applies a function to every cell of the dataframe.


回答 3

我将这样做:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

要么

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

您可以剥离所有str,然后将空str替换为np.nan

I will did this:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

or

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

You can strip all str, then replace empty str with np.nan.


回答 4

所有解决方案中最简单的:

df = df.replace(r'^\s+$', np.nan, regex=True)

Simplest of all solutions:

df = df.replace(r'^\s+$', np.nan, regex=True)

回答 5

如果要从CSV文件导出数据,则可以像这样简单:

df = pd.read_csv(file_csv, na_values=' ')

这将创建数据框并将空白值替换为Na

If you are exporting the data from the CSV file it can be as simple as this :

df = pd.read_csv(file_csv, na_values=' ')

This will create the data frame as well as replace blank values as Na


回答 6

对于一个非常快速,简单的解决方案,您可以根据一个值检查是否相等,可以使用该mask方法。

df.mask(df == ' ')

For a very fast and simple solution where you check equality against a single value, you can use the mask method.

df.mask(df == ' ')

回答 7

这些都是接近正确答案的方法,但是我不会说任何解决问题的方法,同时让其他人仍然最容易阅读您的代码。我会说答案是BrenBarn的答案和tuomasttik在该答案下方的评论的结合。BrenBarn的答案利用了isspace内置函数,但不支持按照OP的要求删除空字符串,而我倾向于将其归为用null替换字符串的标准用例。

我用重写了它.apply,因此可以在pd.Series或上调用它pd.DataFrame


Python 3:

替换空字符串或整个空格的字符串:

df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)

要替换整个空格字符串:

df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)

要在Python 2中使用此代码,您需要替换strbasestring

Python 2:

替换空字符串或整个空格的字符串:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)

要替换整个空格字符串:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

These are all close to the right answer, but I wouldn’t say any solve the problem while remaining most readable to others reading your code. I’d say that answer is a combination of BrenBarn’s Answer and tuomasttik’s comment below that answer. BrenBarn’s answer utilizes isspace builtin, but does not support removing empty strings, as OP requested, and I would tend to attribute that as the standard use case of replacing strings with null.

I rewrote it with .apply, so you can call it on a pd.Series or pd.DataFrame.


Python 3:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)

To use this in Python 2, you’ll need to replace str with basestring.

Python 2:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

回答 8

这对我有用。导入csv文件时,我添加了na_values =”。默认的NaN值中不包含空格。

df = pd.read_csv(filepath,na_values =”)

This worked for me. When I import my csv file I added na_values = ‘ ‘. Spaces are not included in the default NaN values.

df= pd.read_csv(filepath,na_values = ‘ ‘)


回答 9

您还可以使用过滤器来执行此操作。

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

you can also use a filter to do it.

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

回答 10

print(df.isnull().sum()) # check numbers of null value in each column

modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"

# modifiedDf = fd.dropna() # Remove rows with empty values

print(modifiedDf.isnull().sum()) # check numbers of null value in each column
print(df.isnull().sum()) # check numbers of null value in each column

modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"

# modifiedDf = fd.dropna() # Remove rows with empty values

print(modifiedDf.isnull().sum()) # check numbers of null value in each column

回答 11

这不是一个很好的解决方案,但是似乎有效的方法是保存到XLSX,然后将其重新导入。不确定为什么,此页面上的其他解决方案对我不起作用。

data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)

This is not an elegant solution, but what does seem to work is saving to XLSX and then importing it back. The other solutions on this page did not work for me, unsure why.

data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)