标签归档:pandas

有条件替换熊猫

问题:有条件替换熊猫

我有一个DataFrame,我想用超过零的值替换特定列中的值。我以为这是实现此目标的一种方式:

df[df.my_channel > 20000].my_channel = 0

如果将通道复制到新的数据框中,这很简单:

df2 = df.my_channel 

df2[df2 > 20000] = 0

这完全符合我的要求,但似乎无法与通道一起用作原始DataFrame的一部分。

I have a DataFrame, and I want to replace the values in a particular column that exceed a value with zero. I had thought this was a way of achieving this:

df[df.my_channel > 20000].my_channel = 0

If I copy the channel into a new data frame it’s simple:

df2 = df.my_channel 

df2[df2 > 20000] = 0

This does exactly what I want, but seems not to work with the channel as part of the original DataFrame.


回答 0

.ixindexer可以在0.20.0之前的熊猫版本上正常工作,但是由于pandas为0.20.0 ,因此不推荐使用.ix indexer ,因此应避免使用它。而是可以使用或索引器。您可以通过以下方法解决此问题:.lociloc

mask = df.my_channel > 20000
column_name = 'my_channel'
df.loc[mask, column_name] = 0

或者,一行

df.loc[df.my_channel > 20000, 'my_channel'] = 0

mask帮助您选择这些行df.my_channel > 20000True,而df.loc[mask, column_name] = 0将值0到所选择的行,其中mask在其名称是列存放column_name

更新: 在这种情况下,应该使用,loc因为如果使用iloc,则会NotImplementedError告诉您基于iLocation的基于整数类型的布尔索引不可用

.ix indexer works okay for pandas version prior to 0.20.0, but since pandas 0.20.0, the .ix indexer is deprecated, so you should avoid using it. Instead, you can use .loc or iloc indexers. You can solve this problem by:

mask = df.my_channel > 20000
column_name = 'my_channel'
df.loc[mask, column_name] = 0

Or, in one line,

df.loc[df.my_channel > 20000, 'my_channel'] = 0

mask helps you to select the rows in which df.my_channel > 20000 is True, while df.loc[mask, column_name] = 0 sets the value 0 to the selected rows where maskholds in the column which name is column_name.

Update: In this case, you should use loc because if you use iloc, you will get a NotImplementedError telling you that iLocation based boolean indexing on an integer type is not available.


回答 1

尝试

df.loc[df.my_channel > 20000, 'my_channel'] = 0

注: 由于v0.20.0,ix 已被弃用,赞成loc/ iloc

Try

df.loc[df.my_channel > 20000, 'my_channel'] = 0

Note: Since v0.20.0, ix has been deprecated in favour of loc / iloc.


回答 2

np.where 功能如下:

df['X'] = np.where(df['Y']>=50, 'yes', 'no')

在您的情况下,您需要:

import numpy as np
df['my_channel'] = np.where(df.my_channel > 20000, 0, df.my_channel)

np.where function works as follows:

df['X'] = np.where(df['Y']>=50, 'yes', 'no')

In your case you would want:

import numpy as np
df['my_channel'] = np.where(df.my_channel > 20000, 0, df.my_channel)

回答 3

原始数据框不更新的原因是,链接索引可能会导致您修改副本而不是数据框的视图。该文档提供了以下建议:

在熊猫对象中设置值时,必须注意避免所谓的链接索引。

您有几种选择:-

loc +布尔索引

loc 可以用于设置值并支持布尔掩码:

df.loc[df['my_channel'] > 20000, 'my_channel'] = 0

mask +布尔索引

您可以分配给您的系列:

df['my_channel'] = df['my_channel'].mask(df['my_channel'] > 20000, 0)

或者,您可以就地更新系列:

df['my_channel'].mask(df['my_channel'] > 20000, 0, inplace=True)

np.where +布尔索引

可以通过分配当你的条件原系列使用NumPy的满足的; 但是,前两种解决方案更干净,因为它们仅显式更改指定的值。

df['my_channel'] = np.where(df['my_channel'] > 20000, 0, df['my_channel'])

The reason your original dataframe does not update is because chained indexing may cause you to modify a copy rather than a view of your dataframe. The docs give this advice:

When setting values in a pandas object, care must be taken to avoid what is called chained indexing.

You have a few alternatives:-

loc + Boolean indexing

loc may be used for setting values and supports Boolean masks:

df.loc[df['my_channel'] > 20000, 'my_channel'] = 0

mask + Boolean indexing

You can assign to your series:

df['my_channel'] = df['my_channel'].mask(df['my_channel'] > 20000, 0)

Or you can update your series in place:

df['my_channel'].mask(df['my_channel'] > 20000, 0, inplace=True)

np.where + Boolean indexing

You can use NumPy by assigning your original series when your condition is not satisfied; however, the first two solutions are cleaner since they explicitly change only specified values.

df['my_channel'] = np.where(df['my_channel'] > 20000, 0, df['my_channel'])

回答 4

我会用lambda一个函数SeriesDataFrame是这样的:

f = lambda x: 0 if x>100 else 1
df['my_column'] = df['my_column'].map(f)

我没有断言这是一种有效的方法,但是效果很好。

I would use lambda function on a Series of a DataFrame like this:

f = lambda x: 0 if x>100 else 1
df['my_column'] = df['my_column'].map(f)

I do not assert that this is an efficient way, but it works fine.


回答 5

试试这个:

df.my_channel = df.my_channel.where(df.my_channel <= 20000, other= 0)

要么

df.my_channel = df.my_channel.mask(df.my_channel > 20000, other= 0)

Try this:

df.my_channel = df.my_channel.where(df.my_channel <= 20000, other= 0)

or

df.my_channel = df.my_channel.mask(df.my_channel > 20000, other= 0)


熊猫groupby:如何获得字符串的并集

问题:熊猫groupby:如何获得字符串的并集

我有一个这样的数据框:

   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

呼唤

In [10]: print df.groupby("A")["B"].sum()

将返回

A
1    1.615586
2    0.421821
3    0.463468
4    0.643961

现在,我想对列“ C”执行“相同”操作。因为该列包含字符串,所以sum()不起作用(尽管您可能认为它将字符串连接在一起)。我真正想看到的是每个组的字符串列表或一组字符串,即

A
1    {This, string}
2    {is, !}
3    {a}
4    {random}

我一直在尝试找到方法来做到这一点。

尽管Series.unique()(http://pandas.pydata.org/pandas-docs/stable/genic/pandas.Series.unique.html)无效,但是

df.groupby("A")["B"]

是一个

pandas.core.groupby.SeriesGroupBy object

所以我希望任何Series方法都可以。有任何想法吗?

I have a dataframe like this:

   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

Calling

In [10]: print df.groupby("A")["B"].sum()

will return

A
1    1.615586
2    0.421821
3    0.463468
4    0.643961

Now I would like to do “the same” for column “C”. Because that column contains strings, sum() doesn’t work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.

A
1    {This, string}
2    {is, !}
3    {a}
4    {random}

I have been trying to find ways to do this.

Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn’t work, although

df.groupby("A")["B"]

is a

pandas.core.groupby.SeriesGroupBy object

so I was hoping any Series method would work. Any ideas?


回答 0

In [4]: df = read_csv(StringIO(data),sep='\s+')

In [5]: df
Out[5]: 
   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

In [6]: df.dtypes
Out[6]: 
A      int64
B    float64
C     object
dtype: object

当您应用自己的功能时,不会自动排除非数字列。这会慢一些,但比应用.sum()groupby

In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]: 
   A         B           C
A                         
1  2  1.615586  Thisstring
2  4  0.421821         is!
3  3  0.463468           a
4  4  0.643961      random

sum 默认情况下串联

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]: 
A
1    Thisstring
2           is!
3             a
4        random
dtype: object

你几乎可以做你想做的

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]: 
A
1    {This, string}
2           {is, !}
3               {a}
4          {random}
dtype: object

在整个框架上进行一次,一次一组。关键是要返回一个Series

def f(x):
     return Series(dict(A = x['A'].sum(), 
                        B = x['B'].sum(), 
                        C = "{%s}" % ', '.join(x['C'])))

In [14]: df.groupby('A').apply(f)
Out[14]: 
   A         B               C
A                             
1  2  1.615586  {This, string}
2  4  0.421821         {is, !}
3  3  0.463468             {a}
4  4  0.643961        {random}
In [4]: df = read_csv(StringIO(data),sep='\s+')

In [5]: df
Out[5]: 
   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

In [6]: df.dtypes
Out[6]: 
A      int64
B    float64
C     object
dtype: object

When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby

In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]: 
   A         B           C
A                         
1  2  1.615586  Thisstring
2  4  0.421821         is!
3  3  0.463468           a
4  4  0.643961      random

sum by default concatenates

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]: 
A
1    Thisstring
2           is!
3             a
4        random
dtype: object

You can do pretty much what you want

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]: 
A
1    {This, string}
2           {is, !}
3               {a}
4          {random}
dtype: object

Doing this on a whole frame, one group at a time. Key is to return a Series

def f(x):
     return Series(dict(A = x['A'].sum(), 
                        B = x['B'].sum(), 
                        C = "{%s}" % ', '.join(x['C'])))

In [14]: df.groupby('A').apply(f)
Out[14]: 
   A         B               C
A                             
1  2  1.615586  {This, string}
2  4  0.421821         {is, !}
3  3  0.463468             {a}
4  4  0.643961        {random}

回答 1

您可以使用该apply方法将任意函数应用于分组数据。因此,如果您想要一套,请套用set。如果需要列表,请应用list

>>> d
   A       B
0  1    This
1  2      is
2  3       a
3  4  random
4  1  string
5  2       !
>>> d.groupby('A')['B'].apply(list)
A
1    [This, string]
2           [is, !]
3               [a]
4          [random]
dtype: object

如果您还需要其他功能,只需编写一个函数即可执行所需的操作apply

You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.

>>> d
   A       B
0  1    This
1  2      is
2  3       a
3  4  random
4  1  string
5  2       !
>>> d.groupby('A')['B'].apply(list)
A
1    [This, string]
2           [is, !]
3               [a]
4          [random]
dtype: object

If you want something else, just write a function that does what you want and then apply that.


回答 2

您可能可以使用aggregate(或agg)函数来连接值。(未经测试的代码)

df.groupby('A')['B'].agg(lambda col: ''.join(col))

You may be able to use the aggregate (or agg) function to concatenate the values. (Untested code)

df.groupby('A')['B'].agg(lambda col: ''.join(col))

回答 3

您可以尝试以下方法:

df.groupby('A').agg({'B':'sum','C':'-'.join})

You could try this:

df.groupby('A').agg({'B':'sum','C':'-'.join})

回答 4

一个简单的解决方案是:

>>> df.groupby(['A','B']).c.unique().reset_index()

a simple solution would be :

>>> df.groupby(['A','B']).c.unique().reset_index()

回答 5

以命名聚合 pandas >= 0.25.0

从pandas 0.25.0版开始,我们已命名聚合,可以在其中进行分组,聚合并同时为我们的列分配新名称。这样,我们就不会获得MultiIndex列,并且鉴于它们包含的数据,这些列的名称更有意义:


汇总并获取字符串列表

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', list)).reset_index()

print(grp)
   A     B_sum               C
0  1  1.615586  [This, string]
1  2  0.421821         [is, !]
2  3  0.463468             [a]
3  4  0.643961        [random]

汇总并加入字符串

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', ', '.join)).reset_index()

print(grp)
   A     B_sum             C
0  1  1.615586  This, string
1  2  0.421821         is, !
2  3  0.463468             a
3  4  0.643961        random

Named aggregations with pandas >= 0.25.0

Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won’t get the MultiIndex columns, and the column names make more sense given the data they contain:


aggregate and get a list of strings

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', list)).reset_index()

print(grp)
   A     B_sum               C
0  1  1.615586  [This, string]
1  2  0.421821         [is, !]
2  3  0.463468             [a]
3  4  0.643961        [random]

aggregate and join the strings

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', ', '.join)).reset_index()

print(grp)
   A     B_sum             C
0  1  1.615586  This, string
1  2  0.421821         is, !
2  3  0.463468             a
3  4  0.643961        random

回答 6

如果您想覆盖数据框中的B列,则应该可以使用:

    df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))

If you’d like to overwrite column B in the dataframe, this should work:

    df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))

回答 7

遵循@Erfan的好答案,大多数时候,在分析聚合值时,您希望这些现有字符值的唯一可能组合:

unique_chars = lambda x: ', '.join(x.unique())
(df
 .groupby(['A'])
 .agg({'C': unique_chars}))

Following @Erfan’s good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:

unique_chars = lambda x: ', '.join(x.unique())
(df
 .groupby(['A'])
 .agg({'C': unique_chars}))

查找具有每一行最大值的列名

问题:查找具有每一行最大值的列名

我有一个像这样的DataFrame:

In [7]:
frame.head()
Out[7]:
Communications and Search   Business    General Lifestyle
0   0.745763    0.050847    0.118644    0.084746
0   0.333333    0.000000    0.583333    0.083333
0   0.617021    0.042553    0.297872    0.042553
0   0.435897    0.000000    0.410256    0.153846
0   0.358974    0.076923    0.410256    0.153846

在这里,我想问一下如何获取每一行具有最大值的列名,所需的输出是这样的:

In [7]:
    frame.head()
    Out[7]:
    Communications and Search   Business    General Lifestyle   Max
    0   0.745763    0.050847    0.118644    0.084746           Communications 
    0   0.333333    0.000000    0.583333    0.083333           Business  
    0   0.617021    0.042553    0.297872    0.042553           Communications 
    0   0.435897    0.000000    0.410256    0.153846           Communications 
    0   0.358974    0.076923    0.410256    0.153846           Business 

I have a DataFrame like this one:

In [7]:
frame.head()
Out[7]:
Communications and Search   Business    General Lifestyle
0   0.745763    0.050847    0.118644    0.084746
0   0.333333    0.000000    0.583333    0.083333
0   0.617021    0.042553    0.297872    0.042553
0   0.435897    0.000000    0.410256    0.153846
0   0.358974    0.076923    0.410256    0.153846

In here, I want to ask how to get column name which has maximum value for each row, the desired output is like this:

In [7]:
    frame.head()
    Out[7]:
    Communications and Search   Business    General Lifestyle   Max
    0   0.745763    0.050847    0.118644    0.084746           Communications 
    0   0.333333    0.000000    0.583333    0.083333           Business  
    0   0.617021    0.042553    0.297872    0.042553           Communications 
    0   0.435897    0.000000    0.410256    0.153846           Communications 
    0   0.358974    0.076923    0.410256    0.153846           Business 

回答 0

您可以使用idxmaxwith axis=1查找每一行上具有最大值的列:

>>> df.idxmax(axis=1)
0    Communications
1          Business
2    Communications
3    Communications
4          Business
dtype: object

要创建新的列“ Max”,请使用df['Max'] = df.idxmax(axis=1)

要查找每列中出现最大值的索引,请使用df.idxmax()(或等效地df.idxmax(axis=0))。

You can use idxmax with axis=1 to find the column with the greatest value on each row:

>>> df.idxmax(axis=1)
0    Communications
1          Business
2    Communications
3    Communications
4          Business
dtype: object

To create the new column ‘Max’, use df['Max'] = df.idxmax(axis=1).

To find the row index at which the maximum value occurs in each column, use df.idxmax() (or equivalently df.idxmax(axis=0)).


回答 1

如果要生成包含最大值的列名但仅考虑列子集的列,则可以使用@ajcr答案的变体:

df['Max'] = df[['Communications','Business']].idxmax(axis=1)

And if you want to produce a column containing the name of the column with the maximum value but considering only a subset of columns then you use a variation of @ajcr’s answer:

df['Max'] = df[['Communications','Business']].idxmax(axis=1)

回答 2

您可以apply在数据框上并argmax()通过获取每一行axis=1

In [144]: df.apply(lambda x: x.argmax(), axis=1)
Out[144]:
0    Communications
1          Business
2    Communications
3    Communications
4          Business
dtype: object

这里有一个基准来比较慢apply的方法是idxmax()len(df) ~ 20K

In [146]: %timeit df.apply(lambda x: x.argmax(), axis=1)
1 loops, best of 3: 479 ms per loop

In [147]: %timeit df.idxmax(axis=1)
10 loops, best of 3: 47.3 ms per loop

You could apply on dataframe and get argmax() of each row via axis=1

In [144]: df.apply(lambda x: x.argmax(), axis=1)
Out[144]:
0    Communications
1          Business
2    Communications
3    Communications
4          Business
dtype: object

Here’s a benchmark to compare how slow apply method is to idxmax() for len(df) ~ 20K

In [146]: %timeit df.apply(lambda x: x.argmax(), axis=1)
1 loops, best of 3: 479 ms per loop

In [147]: %timeit df.idxmax(axis=1)
10 loops, best of 3: 47.3 ms per loop

如何在一次分配中向熊猫数据框添加多列?

问题:如何在一次分配中向熊猫数据框添加多列?

我是熊猫的新手,试图弄清楚如何同时向熊猫添加多列。感谢您的帮助。理想情况下,我希望一步一步完成此操作,而不是重复多次…

import pandas as pd

df = {'col_1': [0, 1, 2, 3],
        'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)

df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3]  #thought this would work here...

I’m new to pandas and trying to figure out how to add multiple columns to pandas simultaneously. Any help here is appreciated. Ideally I would like to do this in one step rather than multiple repeated steps…

import pandas as pd

df = {'col_1': [0, 1, 2, 3],
        'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)

df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3]  #thought this would work here...

回答 0

我希望您的语法也能正常工作。出现问题是因为当您使用column-list语法(df[[new1, new2]] = ...)创建新列时,pandas要求右侧为DataFrame(请注意,如果DataFrame的列与列的名称相同,则实际上并不重要您正在创建)。

您的语法可以很好地为现有列分配标量值,并且pandas也很乐意使用单列语法(df[new1] = ...)将标量值分配给新列。因此,解决方案是将其转换为几个单列分配,或者为右侧创建一个合适的DataFrame。

这里有几种方法是工作:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col_1': [0, 1, 2, 3],
    'col_2': [4, 5, 6, 7]
})

然后执行以下操作之一:

1)使用列表拆包,将三个作业合二为一:

df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]

2)DataFrame方便地扩展单个行以匹配索引,因此您可以执行以下操作:

df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)

3)用新列创建一个临时数据框,然后与原始数据框合并:

df = pd.concat(
    [
        df,
        pd.DataFrame(
            [[np.nan, 'dogs', 3]], 
            index=df.index, 
            columns=['column_new_1', 'column_new_2', 'column_new_3']
        )
    ], axis=1
)

4)与前面类似,但是使用join代替concat(可能效率较低):

df = df.join(pd.DataFrame(
    [[np.nan, 'dogs', 3]], 
    index=df.index, 
    columns=['column_new_1', 'column_new_2', 'column_new_3']
))

5)使用dict比前两个更“自然”地创建新数据框,但是新列将按字母顺序排序(至少在Python 3.6或3.7之前):

df = df.join(pd.DataFrame(
    {
        'column_new_1': np.nan,
        'column_new_2': 'dogs',
        'column_new_3': 3
    }, index=df.index
))

6).assign()与多个列参数一起使用。

我非常喜欢@zero的答案中的此变体,但像上一个一样,新列将始终按字母顺序排序,至少在早期版本的Python中:

df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)

7)这很有趣(基于https://stackoverflow.com/a/44951376/3830997),但是我不知道什么时候值得这样做:

new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols)   # add empty cols
df[new_cols] = new_vals  # multi-column assignment works for existing cols

8)最后,很难击败三个独立的任务:

df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3

注意:这些选项中的许多选项已经包含在其他答案中:将多个列添加到DataFrame并将它们设置为等于现有列是否可以一次将多个列添加到pandas DataFrame?向pandas DataFrame添加多个空列

I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn’t actually matter if the columns of the DataFrame have the same names as the columns you are creating).

Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.

Here are several approaches that will work:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col_1': [0, 1, 2, 3],
    'col_2': [4, 5, 6, 7]
})

Then one of the following:

1) Three assignments in one, using list unpacking:

df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]

2) DataFrame conveniently expands a single row to match the index, so you can do this:

df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)

3) Make a temporary data frame with new columns, then combine with the original data frame later:

df = pd.concat(
    [
        df,
        pd.DataFrame(
            [[np.nan, 'dogs', 3]], 
            index=df.index, 
            columns=['column_new_1', 'column_new_2', 'column_new_3']
        )
    ], axis=1
)

4) Similar to the previous, but using join instead of concat (may be less efficient):

df = df.join(pd.DataFrame(
    [[np.nan, 'dogs', 3]], 
    index=df.index, 
    columns=['column_new_1', 'column_new_2', 'column_new_3']
))

5) Using a dict is a more “natural” way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):

df = df.join(pd.DataFrame(
    {
        'column_new_1': np.nan,
        'column_new_2': 'dogs',
        'column_new_3': 3
    }, index=df.index
))

6) Use .assign() with multiple column arguments.

I like this variant on @zero’s answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:

df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)

7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don’t know when it would be worth the trouble:

new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols)   # add empty cols
df[new_cols] = new_vals  # multi-column assignment works for existing cols

8) In the end it’s hard to beat three separate assignments:

df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3

Note: many of these options have already been covered in other answers: Add multiple columns to DataFrame and set them equal to an existing column, Is it possible to add several columns at once to a pandas DataFrame?, Add multiple empty columns to pandas DataFrame


回答 1

您可以使用assign列名称和值的字典。

In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
   col_1  col_2 col2_new_2  col3_new_3  col_new_1
0      0      4       dogs           3        NaN
1      1      5       dogs           3        NaN
2      2      6       dogs           3        NaN
3      3      7       dogs           3        NaN

You could use assign with a dict of column names and values.

In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
   col_1  col_2 col2_new_2  col3_new_3  col_new_1
0      0      4       dogs           3        NaN
1      1      5       dogs           3        NaN
2      2      6       dogs           3        NaN
3      3      7       dogs           3        NaN

回答 2

随着concat的使用:

In [128]: df
Out[128]: 
   col_1  col_2
0      0      4
1      1      5
2      2      6
3      3      7

In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]: 
   col_1  col_2 column_new_1 column_new_2 column_new_3
0    0.0    4.0          NaN          NaN          NaN
1    1.0    5.0          NaN          NaN          NaN
2    2.0    6.0          NaN          NaN          NaN
3    3.0    7.0          NaN          NaN          NaN

不太确定您想做什么[np.nan, 'dogs',3]。也许现在将它们设置为默认值?

In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]

In [144]: df1
Out[144]: 
   col_1  col_2  column_new_1 column_new_2  column_new_3
0    0.0    4.0           NaN         dogs             3
1    1.0    5.0           NaN         dogs             3
2    2.0    6.0           NaN         dogs             3
3    3.0    7.0           NaN         dogs             3

With the use of concat:

In [128]: df
Out[128]: 
   col_1  col_2
0      0      4
1      1      5
2      2      6
3      3      7

In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]: 
   col_1  col_2 column_new_1 column_new_2 column_new_3
0    0.0    4.0          NaN          NaN          NaN
1    1.0    5.0          NaN          NaN          NaN
2    2.0    6.0          NaN          NaN          NaN
3    3.0    7.0          NaN          NaN          NaN

Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?

In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]

In [144]: df1
Out[144]: 
   col_1  col_2  column_new_1 column_new_2  column_new_3
0    0.0    4.0           NaN         dogs             3
1    1.0    5.0           NaN         dogs             3
2    2.0    6.0           NaN         dogs             3
3    3.0    7.0           NaN         dogs             3

回答 3

使用列表理解,pd.DataFrame以及pd.concat

pd.concat(
    [
        df,
        pd.DataFrame(
            [[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
            df.index, ['column_new_1', 'column_new_2','column_new_3']
        )
    ], axis=1)

use of list comprehension, pd.DataFrame and pd.concat

pd.concat(
    [
        df,
        pd.DataFrame(
            [[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
            df.index, ['column_new_1', 'column_new_2','column_new_3']
        )
    ], axis=1)


回答 4

如果添加许多具有相同值的缺失列(a,b,c,….),这里为0,我这样做:

    new_cols = ["a", "b", "c" ] 
    df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)

它基于已接受答案的第二个变体。

if adding a lot of missing columns (a, b, c ,….) with the same value, here 0, i did this:

    new_cols = ["a", "b", "c" ] 
    df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)

It’s based on the second variant of the accepted answer.


回答 5

只想指出@Matthias Fripp的答案中的option2

(2)我不一定希望DataFrame可以这种方式工作,但确实可以

df [[”column_new_1’,’column_new_2’,’column_new_3′]] = pd.DataFrame([[np.nan,’dogs’,3]],index = df.index)

已记录在熊猫自己的文档中 http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics

您可以将列列表传递给[],以按此顺序选择列。如果DataFrame中不包含任何列,则将引发异常。 也可以以此方式设置多列。 您可能会发现这对于将转换(就地)应用于列的子集很有用。

Just want to point out that option2 in @Matthias Fripp’s answer

(2) I wouldn’t necessarily expect DataFrame to work this way, but it does

df[[‘column_new_1’, ‘column_new_2’, ‘column_new_3’]] = pd.DataFrame([[np.nan, ‘dogs’, 3]], index=df.index)

is already documented in pandas’ own documentation http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics

You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner. You may find this useful for applying a transform (in-place) to a subset of the columns.


回答 6

如果您只想添加空的新列,则reindex将完成此工作

df
   col_1  col_2
0      0      4
1      1      5
2      2      6
3      3      7

df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
   col_1  col_2  column_new_1  column_new_2  column_new_3
0      0      4           NaN           NaN           NaN
1      1      5           NaN           NaN           NaN
2      2      6           NaN           NaN           NaN
3      3      7           NaN           NaN           NaN

完整的代码示例

import numpy as np
import pandas as pd

df = {'col_1': [0, 1, 2, 3],
        'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')

否则去分配答案

If you just want to add empty new columns, reindex will do the job

df
   col_1  col_2
0      0      4
1      1      5
2      2      6
3      3      7

df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
   col_1  col_2  column_new_1  column_new_2  column_new_3
0      0      4           NaN           NaN           NaN
1      1      5           NaN           NaN           NaN
2      2      6           NaN           NaN           NaN
3      3      7           NaN           NaN           NaN

full code example

import numpy as np
import pandas as pd

df = {'col_1': [0, 1, 2, 3],
        'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')

otherwise go for zeros answer with assign


回答 7

我不喜欢使用“索引”,依此类推…可能如下

df.columns
Index(['A123', 'B123'], dtype='object')

df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])

df.rename(columns={
    'C':'C123',
    'D':'D123',
    'E':'E123'
},inplace=True)


df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')

I am not comfortable using “Index” and so on…could come up as below

df.columns
Index(['A123', 'B123'], dtype='object')

df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])

df.rename(columns={
    'C':'C123',
    'D':'D123',
    'E':'E123'
},inplace=True)


df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')

获取熊猫应用函数中的行的索引

问题:获取熊猫应用函数中的行的索引

我正在尝试在整个DataFramePandas中应用的函数中访问行的索引。我有这样的事情:

df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df
   a  b  c
0  1  2  3
1  4  5  6

我将定义一个函数来访问给定行的元素

def rowFunc(row):
    return row['a'] + row['b'] * row['c']

我可以这样应用它:

df['d'] = df.apply(rowFunc, axis=1)
>>> df
   a  b  c   d
0  1  2  3   7
1  4  5  6  34

太棒了!现在,如果我想将索引合并到函数中怎么办?DataFrame在添加之前,该行中任何给定行的索引都d将是Index([u'a', u'b', u'c', u'd'], dtype='object'),但是我想要0和1。所以我不能只访问row.index

我知道我可以在存储索引的表中创建一个临时列,但是我想知道它是否存储在行对象的某个地方。

I am trying to access the index of a row in a function applied across an entire DataFrame in Pandas. I have something like this:

df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df
   a  b  c
0  1  2  3
1  4  5  6

and I’ll define a function that access elements with a given row

def rowFunc(row):
    return row['a'] + row['b'] * row['c']

I can apply it like so:

df['d'] = df.apply(rowFunc, axis=1)
>>> df
   a  b  c   d
0  1  2  3   7
1  4  5  6  34

Awesome! Now what if I want to incorporate the index into my function? The index of any given row in this DataFrame before adding d would be Index([u'a', u'b', u'c', u'd'], dtype='object'), but I want the 0 and 1. So I can’t just access row.index.

I know I could create a temporary column in the table where I store the index, but I’m wondering if it is stored in the row object somewhere.


回答 0

在这种情况下,要访问索引,请访问name属性:

In [182]:

df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
    return row['a'] + row['b'] * row['c']

def rowIndex(row):
    return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
   a  b  c   d  rowIndex
0  1  2  3   7         0
1  4  5  6  34         1

请注意,如果这确实是您要尝试执行的操作,则可以使用以下命令并且速度更快:

In [198]:

df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
   a  b  c   d
0  1  2  3   7
1  4  5  6  34

In [199]:

%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop

编辑

3年后再看这个问题,您可以这样做:

In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df

Out[15]: 
   a  b  c   d  rowIndex
0  1  2  3   7         0
1  4  5  6  34         1

但是假设它并不那么简单,无论您rowFunc实际上在做什么,您都应该使用向量化函数,然后针对df索引使用它们:

In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df

Out[16]: 
   a  b  c   d  rowIndex  newCol
0  1  2  3   7         0       6
1  4  5  6  34         1      16

To access the index in this case you access the name attribute:

In [182]:

df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
    return row['a'] + row['b'] * row['c']

def rowIndex(row):
    return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
   a  b  c   d  rowIndex
0  1  2  3   7         0
1  4  5  6  34         1

Note that if this is really what you are trying to do that the following works and is much faster:

In [198]:

df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
   a  b  c   d
0  1  2  3   7
1  4  5  6  34

In [199]:

%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop

EDIT

Looking at this question 3+ years later, you could just do:

In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df

Out[15]: 
   a  b  c   d  rowIndex
0  1  2  3   7         0
1  4  5  6  34         1

but assuming it isn’t as trivial as this, whatever your rowFunc is really doing, you should look to use the vectorised functions, and then use them against the df index:

In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df

Out[16]: 
   a  b  c   d  rowIndex  newCol
0  1  2  3   7         0       6
1  4  5  6  34         1      16

回答 1

要么:

1.与row.name内线apply(..., axis=1)通话:

df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])

   a  b  c
x  1  2  3
y  4  5  6

df.apply(lambda row: row.name, axis=1)

x    x
y    y

2.与iterrows()(较慢)

DataFrame.iterrows()允许您遍历行并访问其索引:

for idx, row in df.iterrows():
    ...

Either:

1. with row.name inside the apply(..., axis=1) call:

df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])

   a  b  c
x  1  2  3
y  4  5  6

df.apply(lambda row: row.name, axis=1)

x    x
y    y

2. with iterrows() (slower)

DataFrame.iterrows() allows you to iterate over rows, and access their index:

for idx, row in df.iterrows():
    ...

回答 2

要回答原始问题:是的,您可以在中访问行的索引值apply()。它在键下可用,name并且需要您指定axis=1(因为lambda处理行的列而不是列的行)。

工作示例(熊猫0.23.4):

>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
   b  c
a      
1  2  3
4  5  6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
   b  c  index_x10
a                 
1  2  3         10
4  5  6         40

To answer the original question: yes, you can access the index value of a row in apply(). It is available under the key name and requires that you specify axis=1 (because the lambda processes the columns of a row and not the rows of a column).

Working example (pandas 0.23.4):

>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
   b  c
a      
1  2  3
4  5  6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
   b  c  index_x10
a                 
1  2  3         10
4  5  6         40

如何在Pandas中找到数字列?

问题:如何在Pandas中找到数字列?

假设df是一个熊猫DataFrame。我想找到所有数字类型的列。就像是:

isNumeric = is_numeric(df)

Let’s say df is a pandas DataFrame. I would like to find all columns of numeric type. Something like:

isNumeric = is_numeric(df)

回答 0

您可以使用select_dtypesDataFrame的方法。它包括两个参数include和exclude。所以isNumeric看起来像:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = df.select_dtypes(include=numerics)

You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = df.select_dtypes(include=numerics)

回答 1

您可以使用未记录的功能_get_numeric_data()来仅过滤数字列:

df._get_numeric_data()

例:

In [32]: data
Out[32]:
   A  B
0  1  s
1  2  s
2  3  s
3  4  s

In [33]: data._get_numeric_data()
Out[33]:
   A
0  1
1  2
2  3
3  4

注意,这是一个“私有方法”(即实现细节),将来可能会更改或完全删除。请谨慎使用

You can use the undocumented function _get_numeric_data() to filter only numeric columns:

df._get_numeric_data()

Example:

In [32]: data
Out[32]:
   A  B
0  1  s
1  2  s
2  3  s
3  4  s

In [33]: data._get_numeric_data()
Out[33]:
   A
0  1
1  2
2  3
3  4

Note that this is a “private method” (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.


回答 2

简单的单行答案即可创建仅包含数字列的新数据框:

df.select_dtypes(include=np.number)

如果需要数字列的名称:

df.select_dtypes(include=np.number).columns.tolist()

完整的代码:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': range(7, 10),
                   'B': np.random.rand(3),
                   'C': ['foo','bar','baz'],
                   'D': ['who','what','when']})
df
#    A         B    C     D
# 0  7  0.704021  foo   who
# 1  8  0.264025  bar  what
# 2  9  0.230671  baz  when

df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
#    A         B
# 0  7  0.704021
# 1  8  0.264025
# 2  9  0.230671

colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']

Simple one-line answer to create a new dataframe with only numeric columns:

df.select_dtypes(include=np.number)

If you want the names of numeric columns:

df.select_dtypes(include=np.number).columns.tolist()

Complete code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': range(7, 10),
                   'B': np.random.rand(3),
                   'C': ['foo','bar','baz'],
                   'D': ['who','what','when']})
df
#    A         B    C     D
# 0  7  0.704021  foo   who
# 1  8  0.264025  bar  what
# 2  9  0.230671  baz  when

df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
#    A         B
# 0  7  0.704021
# 1  8  0.264025
# 2  9  0.230671

colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']

回答 3

df.select_dtypes(exclude=['object'])
df.select_dtypes(exclude = ['object'])

Update

df.select_dtypes(inlcude = np.number)
#or with new version of panda
df.select_dtypes('number')

回答 4

简单的一线:

df.select_dtypes('number').columns

Simple one-liner:

df.select_dtypes('number').columns

回答 5

以下代码将返回数据集的数字列的名称列表。

cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)

marketing_train是我的数据集,它select_dtypes()是使用exclude和include参数选择数据类型的功能,而column用于获取上述代码输出的数据集的列名,如下所示:

['custAge',
     'campaign',
     'pdays',
     'previous',
     'emp.var.rate',
     'cons.price.idx',
     'cons.conf.idx',
     'euribor3m',
     'nr.employed',
     'pmonths',
     'pastEmail']

谢谢

Following codes will return list of names of the numeric columns of a data set.

cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)

here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set output of above code will be following:

['custAge',
     'campaign',
     'pdays',
     'previous',
     'emp.var.rate',
     'cons.price.idx',
     'cons.conf.idx',
     'euribor3m',
     'nr.employed',
     'pmonths',
     'pastEmail']

Thanks


回答 6

这是用于在熊猫数据框中查找数字列的另一种简单代码,

numeric_clmns = df.dtypes[df.dtypes != "object"].index 

This is another simple code for finding numeric column in pandas data frame,

numeric_clmns = df.dtypes[df.dtypes != "object"].index 

回答 7

def is_type(df, baseType):
    import numpy as np
    import pandas as pd
    test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
    return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
    import numpy as np
    return is_type(df, np.float)
def is_number(df):
    import numpy as np
    return is_type(df, np.number)
def is_integer(df):
    import numpy as np
    return is_type(df, np.integer)
def is_type(df, baseType):
    import numpy as np
    import pandas as pd
    test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
    return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
    import numpy as np
    return is_type(df, np.float)
def is_number(df):
    import numpy as np
    return is_type(df, np.number)
def is_integer(df):
    import numpy as np
    return is_type(df, np.integer)

回答 8

改编这个答案,你可以做

df.ix[:,df.applymap(np.isreal).all(axis=0)]

在这里,np.applymap(np.isreal)显示数据框中的每个单元格是否都是数字,并.axis(all=0)检查列中的所有值是否均为True,并返回一系列布尔值,这些布尔值可用于索引所需的列。

Adapting this answer, you could do

df.ix[:,df.applymap(np.isreal).all(axis=0)]

Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.


回答 9

请看下面的代码:

if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())

这样,您可以检查值是否为数字,例如float和int或srting值。第二条if语句用于检查对象引用的字符串值。

Please see the below code:

if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())

This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.


回答 10

我们可以根据以下要求包括和排除数据类型:

train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types

从Jupyter Notebook引用。

要选择所有数字类型,请使用np.number'number'

  • 要选择字符串,您必须使用objectdtype,但是请注意,这将返回所有对象dtype列

  • NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__

  • 要选择日期时间,使用np.datetime64'datetime''datetime64'

  • 要选择timedeltas,使用np.timedelta64'timedelta''timedelta64'

  • 要选择Pandas类别dtype,请使用 'category'

  • 要选择Pandas datetimetz dtypes,请使用'datetimetz'(0.20.0中的新功能)或“’datetime64 [ns,tz]’

We can include and exclude data types as per the requirement as below:

train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types

Referred from Jupyter Notebook.

To select all numeric types, use np.number or 'number'

  • To select strings you must use the object dtype but note that this will return all object dtype columns

  • See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__

  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'

  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'

  • To select Pandas categorical dtypes, use 'category'

  • To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or “’datetime64[ns, tz]’


如何在不覆盖数据的情况下(使用熊猫)写入现有的excel文件?

问题:如何在不覆盖数据的情况下(使用熊猫)写入现有的excel文件?

我使用熊猫以以下方式写入excel文件:

import pandas

writer = pandas.ExcelWriter('Masterfile.xlsx') 

data_filtered.to_excel(writer, "Main", cols=['Diff1', 'Diff2'])

writer.save()

Masterfile.xlsx已经包含许多不同的选项卡。但是,它尚未包含“ Main”。

熊猫正确地写入了“主要”表,不幸的是,它也删除了所有其他标签。

I use pandas to write to excel file in the following fashion:

import pandas

writer = pandas.ExcelWriter('Masterfile.xlsx') 

data_filtered.to_excel(writer, "Main", cols=['Diff1', 'Diff2'])

writer.save()

Masterfile.xlsx already consists of number of different tabs. However, it does not yet contain “Main”.

Pandas correctly writes to “Main” sheet, unfortunately it also deletes all other tabs.


回答 0

熊猫文档说,它对xlsx文件使用openpyxl。快速浏览一下其中的代码ExcelWriter可以提示可能会发生以下情况:

import pandas
from openpyxl import load_workbook

book = load_workbook('Masterfile.xlsx')
writer = pandas.ExcelWriter('Masterfile.xlsx', engine='openpyxl') 
writer.book = book

## ExcelWriter for some reason uses writer.sheets to access the sheet.
## If you leave it empty it will not know that sheet Main is already there
## and will create a new sheet.

writer.sheets = dict((ws.title, ws) for ws in book.worksheets)

data_filtered.to_excel(writer, "Main", cols=['Diff1', 'Diff2'])

writer.save()

Pandas docs says it uses openpyxl for xlsx files. Quick look through the code in ExcelWriter gives a clue that something like this might work out:

import pandas
from openpyxl import load_workbook

book = load_workbook('Masterfile.xlsx')
writer = pandas.ExcelWriter('Masterfile.xlsx', engine='openpyxl') 
writer.book = book

## ExcelWriter for some reason uses writer.sheets to access the sheet.
## If you leave it empty it will not know that sheet Main is already there
## and will create a new sheet.

writer.sheets = dict((ws.title, ws) for ws in book.worksheets)

data_filtered.to_excel(writer, "Main", cols=['Diff1', 'Diff2'])

writer.save()

回答 1

这是一个辅助函数:

def append_df_to_excel(filename, df, sheet_name='Sheet1', startrow=None,
                       truncate_sheet=False, 
                       **to_excel_kwargs):
    """
    Append a DataFrame [df] to existing Excel file [filename]
    into [sheet_name] Sheet.
    If [filename] doesn't exist, then this function will create it.

    Parameters:
      filename : File path or existing ExcelWriter
                 (Example: '/path/to/file.xlsx')
      df : dataframe to save to workbook
      sheet_name : Name of sheet which will contain DataFrame.
                   (default: 'Sheet1')
      startrow : upper left cell row to dump data frame.
                 Per default (startrow=None) calculate the last row
                 in the existing DF and write to the next row...
      truncate_sheet : truncate (remove and recreate) [sheet_name]
                       before writing DataFrame to Excel file
      to_excel_kwargs : arguments which will be passed to `DataFrame.to_excel()`
                        [can be dictionary]

    Returns: None
    """
    from openpyxl import load_workbook

    # ignore [engine] parameter if it was passed
    if 'engine' in to_excel_kwargs:
        to_excel_kwargs.pop('engine')

    writer = pd.ExcelWriter(filename, engine='openpyxl')

    # Python 2.x: define [FileNotFoundError] exception if it doesn't exist 
    try:
        FileNotFoundError
    except NameError:
        FileNotFoundError = IOError


    try:
        # try to open an existing workbook
        writer.book = load_workbook(filename)

        # get the last row in the existing Excel sheet
        # if it was not specified explicitly
        if startrow is None and sheet_name in writer.book.sheetnames:
            startrow = writer.book[sheet_name].max_row

        # truncate sheet
        if truncate_sheet and sheet_name in writer.book.sheetnames:
            # index of [sheet_name] sheet
            idx = writer.book.sheetnames.index(sheet_name)
            # remove [sheet_name]
            writer.book.remove(writer.book.worksheets[idx])
            # create an empty sheet [sheet_name] using old index
            writer.book.create_sheet(sheet_name, idx)

        # copy existing sheets
        writer.sheets = {ws.title:ws for ws in writer.book.worksheets}
    except FileNotFoundError:
        # file does not exist yet, we will create it
        pass

    if startrow is None:
        startrow = 0

    # write out the new sheet
    df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs)

    # save the workbook
    writer.save()

注意:对于<0.21.0的熊猫,请替换sheet_namesheetname

用法示例:

append_df_to_excel('d:/temp/test.xlsx', df)

append_df_to_excel('d:/temp/test.xlsx', df, header=None, index=False)

append_df_to_excel('d:/temp/test.xlsx', df, sheet_name='Sheet2', index=False)

append_df_to_excel('d:/temp/test.xlsx', df, sheet_name='Sheet2', index=False, startrow=25)

Here is a helper function:

def append_df_to_excel(filename, df, sheet_name='Sheet1', startrow=None,
                       truncate_sheet=False, 
                       **to_excel_kwargs):
    """
    Append a DataFrame [df] to existing Excel file [filename]
    into [sheet_name] Sheet.
    If [filename] doesn't exist, then this function will create it.

    Parameters:
      filename : File path or existing ExcelWriter
                 (Example: '/path/to/file.xlsx')
      df : dataframe to save to workbook
      sheet_name : Name of sheet which will contain DataFrame.
                   (default: 'Sheet1')
      startrow : upper left cell row to dump data frame.
                 Per default (startrow=None) calculate the last row
                 in the existing DF and write to the next row...
      truncate_sheet : truncate (remove and recreate) [sheet_name]
                       before writing DataFrame to Excel file
      to_excel_kwargs : arguments which will be passed to `DataFrame.to_excel()`
                        [can be dictionary]

    Returns: None

    (c) [MaxU](https://stackoverflow.com/users/5741205/maxu?tab=profile)
    """
    from openpyxl import load_workbook

    # ignore [engine] parameter if it was passed
    if 'engine' in to_excel_kwargs:
        to_excel_kwargs.pop('engine')

    writer = pd.ExcelWriter(filename, engine='openpyxl')

    # Python 2.x: define [FileNotFoundError] exception if it doesn't exist 
    try:
        FileNotFoundError
    except NameError:
        FileNotFoundError = IOError


    try:
        # try to open an existing workbook
        writer.book = load_workbook(filename)
        
        # get the last row in the existing Excel sheet
        # if it was not specified explicitly
        if startrow is None and sheet_name in writer.book.sheetnames:
            startrow = writer.book[sheet_name].max_row

        # truncate sheet
        if truncate_sheet and sheet_name in writer.book.sheetnames:
            # index of [sheet_name] sheet
            idx = writer.book.sheetnames.index(sheet_name)
            # remove [sheet_name]
            writer.book.remove(writer.book.worksheets[idx])
            # create an empty sheet [sheet_name] using old index
            writer.book.create_sheet(sheet_name, idx)
        
        # copy existing sheets
        writer.sheets = {ws.title:ws for ws in writer.book.worksheets}
    except FileNotFoundError:
        # file does not exist yet, we will create it
        pass

    if startrow is None:
        startrow = 0

    # write out the new sheet
    df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs)

    # save the workbook
    writer.save()
            

NOTE: for Pandas < 0.21.0, replace sheet_name with sheetname!

Usage examples:

append_df_to_excel('d:/temp/test.xlsx', df)

append_df_to_excel('d:/temp/test.xlsx', df, header=None, index=False)

append_df_to_excel('d:/temp/test.xlsx', df, sheet_name='Sheet2', index=False)

append_df_to_excel('d:/temp/test.xlsx', df, sheet_name='Sheet2', index=False, startrow=25)

回答 2

使用openpyxlversion 2.4.0pandasversion 0.19.2,@ ski提出的过程变得更加简单:

import pandas
from openpyxl import load_workbook

with pandas.ExcelWriter('Masterfile.xlsx', engine='openpyxl') as writer:
    writer.book = load_workbook('Masterfile.xlsx')
    data_filtered.to_excel(writer, "Main", cols=['Diff1', 'Diff2'])
#That's it!

With openpyxlversion 2.4.0 and pandasversion 0.19.2, the process @ski came up with gets a bit simpler:

import pandas
from openpyxl import load_workbook

with pandas.ExcelWriter('Masterfile.xlsx', engine='openpyxl') as writer:
    writer.book = load_workbook('Masterfile.xlsx')
    data_filtered.to_excel(writer, "Main", cols=['Diff1', 'Diff2'])
#That's it!

回答 3

从pandas 0.24开始,您可以使用mode关键字参数简化此操作ExcelWriter

import pandas as pd

with pd.ExcelWriter('the_file.xlsx', engine='openpyxl', mode='a') as writer: 
     data_filtered.to_excel(writer) 

Starting in pandas 0.24 you can simplify this with the mode keyword argument of ExcelWriter:

import pandas as pd

with pd.ExcelWriter('the_file.xlsx', engine='openpyxl', mode='a') as writer: 
     data_filtered.to_excel(writer) 

回答 4

老问题了,但我猜有些人还在搜索这个-所以…

我发现此方法不错,因为所有工作表都加载到工作表名称和数据框对的字典中,该字典由熊猫使用sheetname = None选项创建。在将电子表格读取为dict格式并将其从dict写回之前,添加,删除或修改工作表很简单。对于我来说,就速度和格式而言,xlsxwriter在执行此特定任务方面比openpyxl更好。

注意:未来版本的熊猫(0.21.0+)将把“ sheetname”参数更改为“ sheet_name”。

# read a single or multi-sheet excel file
# (returns dict of sheetname(s), dataframe(s))
ws_dict = pd.read_excel(excel_file_path,
                        sheetname=None)

# all worksheets are accessible as dataframes.

# easy to change a worksheet as a dataframe:
mod_df = ws_dict['existing_worksheet']

# do work on mod_df...then reassign
ws_dict['existing_worksheet'] = mod_df

# add a dataframe to the workbook as a new worksheet with
# ws name, df as dict key, value:
ws_dict['new_worksheet'] = some_other_dataframe

# when done, write dictionary back to excel...
# xlsxwriter honors datetime and date formats
# (only included as example)...
with pd.ExcelWriter(excel_file_path,
                    engine='xlsxwriter',
                    datetime_format='yyyy-mm-dd',
                    date_format='yyyy-mm-dd') as writer:

    for ws_name, df_sheet in ws_dict.items():
        df_sheet.to_excel(writer, sheet_name=ws_name)

对于2013年问题中的示例:

ws_dict = pd.read_excel('Masterfile.xlsx',
                        sheetname=None)

ws_dict['Main'] = data_filtered[['Diff1', 'Diff2']]

with pd.ExcelWriter('Masterfile.xlsx',
                    engine='xlsxwriter') as writer:

    for ws_name, df_sheet in ws_dict.items():
        df_sheet.to_excel(writer, sheet_name=ws_name)

Old question, but I am guessing some people still search for this – so…

I find this method nice because all worksheets are loaded into a dictionary of sheet name and dataframe pairs, created by pandas with the sheetname=None option. It is simple to add, delete or modify worksheets between reading the spreadsheet into the dict format and writing it back from the dict. For me the xlsxwriter works better than openpyxl for this particular task in terms of speed and format.

Note: future versions of pandas (0.21.0+) will change the “sheetname” parameter to “sheet_name”.

# read a single or multi-sheet excel file
# (returns dict of sheetname(s), dataframe(s))
ws_dict = pd.read_excel(excel_file_path,
                        sheetname=None)

# all worksheets are accessible as dataframes.

# easy to change a worksheet as a dataframe:
mod_df = ws_dict['existing_worksheet']

# do work on mod_df...then reassign
ws_dict['existing_worksheet'] = mod_df

# add a dataframe to the workbook as a new worksheet with
# ws name, df as dict key, value:
ws_dict['new_worksheet'] = some_other_dataframe

# when done, write dictionary back to excel...
# xlsxwriter honors datetime and date formats
# (only included as example)...
with pd.ExcelWriter(excel_file_path,
                    engine='xlsxwriter',
                    datetime_format='yyyy-mm-dd',
                    date_format='yyyy-mm-dd') as writer:

    for ws_name, df_sheet in ws_dict.items():
        df_sheet.to_excel(writer, sheet_name=ws_name)

For the example in the 2013 question:

ws_dict = pd.read_excel('Masterfile.xlsx',
                        sheetname=None)

ws_dict['Main'] = data_filtered[['Diff1', 'Diff2']]

with pd.ExcelWriter('Masterfile.xlsx',
                    engine='xlsxwriter') as writer:

    for ws_name, df_sheet in ws_dict.items():
        df_sheet.to_excel(writer, sheet_name=ws_name)

回答 5

我知道这是一个较旧的线程,但这是您在搜索时发现的第一项,并且如果需要将图表保留在已创建的工作簿中,则上述解决方案将不起作用。在这种情况下,xlwings是一个更好的选择-它允许您写入Excel书并保留图表/图表数据。

简单的例子:

import xlwings as xw
import pandas as pd

#create DF
months = ['2017-01','2017-02','2017-03','2017-04','2017-05','2017-06','2017-07','2017-08','2017-09','2017-10','2017-11','2017-12']
value1 = [x * 5+5 for x in range(len(months))]
df = pd.DataFrame(value1, index = months, columns = ['value1'])
df['value2'] = df['value1']+5
df['value3'] = df['value2']+5

#load workbook that has a chart in it
wb = xw.Book('C:\\data\\bookwithChart.xlsx')

ws = wb.sheets['chartData']

ws.range('A1').options(index=False).value = df

wb = xw.Book('C:\\data\\bookwithChart_updated.xlsx')

xw.apps[0].quit()

I know this is an older thread, but this is the first item you find when searching, and the above solutions don’t work if you need to retain charts in a workbook that you already have created. In that case, xlwings is a better option – it allows you to write to the excel book and keeps the charts/chart data.

simple example:

import xlwings as xw
import pandas as pd

#create DF
months = ['2017-01','2017-02','2017-03','2017-04','2017-05','2017-06','2017-07','2017-08','2017-09','2017-10','2017-11','2017-12']
value1 = [x * 5+5 for x in range(len(months))]
df = pd.DataFrame(value1, index = months, columns = ['value1'])
df['value2'] = df['value1']+5
df['value3'] = df['value2']+5

#load workbook that has a chart in it
wb = xw.Book('C:\\data\\bookwithChart.xlsx')

ws = wb.sheets['chartData']

ws.range('A1').options(index=False).value = df

wb = xw.Book('C:\\data\\bookwithChart_updated.xlsx')

xw.apps[0].quit()

回答 6

在pandas 0.24中有一个更好的解决方案:

with pd.ExcelWriter(path, mode='a') as writer:
    s.to_excel(writer, sheet_name='another sheet', index=False)

之前:

后:

因此,立即升级您的熊猫:

pip install --upgrade pandas

There is a better solution in pandas 0.24:

with pd.ExcelWriter(path, mode='a') as writer:
    s.to_excel(writer, sheet_name='another sheet', index=False)

before:

after:

so upgrade your pandas now:

pip install --upgrade pandas

回答 7

def append_sheet_to_master(self, master_file_path, current_file_path, sheet_name):
    try:
        master_book = load_workbook(master_file_path)
        master_writer = pandas.ExcelWriter(master_file_path, engine='openpyxl')
        master_writer.book = master_book
        master_writer.sheets = dict((ws.title, ws) for ws in master_book.worksheets)
        current_frames = pandas.ExcelFile(current_file_path).parse(pandas.ExcelFile(current_file_path).sheet_names[0],
                                                               header=None,
                                                               index_col=None)
        current_frames.to_excel(master_writer, sheet_name, index=None, header=False)

        master_writer.save()
    except Exception as e:
        raise e

这非常完美,只有主文件(添加新工作表的文件)的格式丢失了。

def append_sheet_to_master(self, master_file_path, current_file_path, sheet_name):
    try:
        master_book = load_workbook(master_file_path)
        master_writer = pandas.ExcelWriter(master_file_path, engine='openpyxl')
        master_writer.book = master_book
        master_writer.sheets = dict((ws.title, ws) for ws in master_book.worksheets)
        current_frames = pandas.ExcelFile(current_file_path).parse(pandas.ExcelFile(current_file_path).sheet_names[0],
                                                               header=None,
                                                               index_col=None)
        current_frames.to_excel(master_writer, sheet_name, index=None, header=False)

        master_writer.save()
    except Exception as e:
        raise e

This works perfectly fine only thing is that formatting of the master file(file to which we add new sheet) is lost.


回答 8

writer = pd.ExcelWriter('prueba1.xlsx'engine='openpyxl',keep_date_col=True)

“ keep_date_col”希望对您有所帮助

writer = pd.ExcelWriter('prueba1.xlsx'engine='openpyxl',keep_date_col=True)

The “keep_date_col” hope help you


回答 9

book = load_workbook(xlsFilename)
writer = pd.ExcelWriter(self.xlsFilename)
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, sheet_name=sheetName, index=False)
writer.save()
book = load_workbook(xlsFilename)
writer = pd.ExcelWriter(self.xlsFilename)
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, sheet_name=sheetName, index=False)
writer.save()

如何在大熊猫中测试字符串是否包含列表中的子字符串之一?

问题:如何在大熊猫中测试字符串是否包含列表中的子字符串之一?

有没有这将是一个组合的等同的任何功能df.isin()df[col].str.contains()

例如,假设我有系列 s = pd.Series(['cat','hat','dog','fog','pet']),并且我想找到s包含的任何一个的所有地方['og', 'at'],那么我想得到除“宠物”以外的所有东西。

我有一个解决方案,但这很不雅致:

searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()

有一个更好的方法吗?

Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?

For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but ‘pet’.

I have a solution, but it’s rather inelegant:

searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()

Is there a better way to do this?


回答 0

一种选择是仅使用正则表达式|字符尝试匹配系列中单词中的每个子字符串s(仍使用str.contains)。

您可以通过将单词searchfor与结合在一起来构造正则表达式|

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0    cat
1    hat
2    dog
3    fog
dtype: object

就像@AndyHayden在下面的注释中指出的那样,请注意您的子字符串是否具有特殊字符,例如$^您想在字面上进行匹配。这些字符在正则表达式的上下文中具有特定含义,并且会影响匹配。

您可以通过转义非字母数字字符来使子字符串列表更安全re.escape

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']

与结合使用时,此新列表中带有的字符串将逐字匹配每个字符str.contains

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0    cat
1    hat
2    dog
3    fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']

The strings with in this new list will match each character literally when used with str.contains.


回答 1

您可以使用str.containsregex模式单独使用OR (|)

s[s.str.contains('og|at')]

或者您可以将系列添加到,dataframe然后使用str.contains

df = pd.DataFrame(s)
df[s.str.contains('og|at')] 

输出:

0 cat
1 hat
2 dog
3 fog 

You can use str.contains alone with a regex pattern using OR (|):

s[s.str.contains('og|at')]

Or you could add the series to a dataframe then use str.contains:

df = pd.DataFrame(s)
df[s.str.contains('og|at')] 

Output:

0 cat
1 hat
2 dog
3 fog 

回答 2

这是一个单行lambda,它也可以工作:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

输入:

searchfor = ['og', 'at']

df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

   col1  col2
0   cat 1000.0
1   hat 2000000.0
2   dog 1000.0
3   fog 330000.0
4   pet 330000.0

应用Lambda:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

输出:

    col1    col2        TrueFalse
0   cat     1000.0      1
1   hat     2000000.0   1
2   dog     1000.0      1
3   fog     330000.0    1
4   pet     330000.0    0

Here is a one line lambda that also works:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Input:

searchfor = ['og', 'at']

df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

   col1  col2
0   cat 1000.0
1   hat 2000000.0
2   dog 1000.0
3   fog 330000.0
4   pet 330000.0

Apply Lambda:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Output:

    col1    col2        TrueFalse
0   cat     1000.0      1
1   hat     2000000.0   1
2   dog     1000.0      1
3   fog     330000.0    1
4   pet     330000.0    0

如何抑制熊猫未来警告?

问题:如何抑制熊猫未来警告?

当我运行程序时,Pandas每次都会发出如下“未来警告”。

D:\Python\lib\site-packages\pandas\core\frame.py:3581: FutureWarning: rename with inplace=True  will return None from pandas 0.11 onward
  " from pandas 0.11 onward", FutureWarning) 

我得到了消息,但我只是想一次又一次地停止Pandas显示这样的消息,是否可以设置任何buildin参数,以使Pandas不会弹出“未来警告”?

When I run the program, Pandas gives ‘Future warning’ like below every time.

D:\Python\lib\site-packages\pandas\core\frame.py:3581: FutureWarning: rename with inplace=True  will return None from pandas 0.11 onward
  " from pandas 0.11 onward", FutureWarning) 

I got the msg, but I just want to stop Pandas showing such msg again and again, is there any buildin parameter that I can set to let Pandas not pop up the ‘Future warning’ ?


回答 0

github上发现了这个…

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas

Found this on github

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas

回答 1

@bdiamante的答案可能只会部分帮助您。如果您在取消警告后仍然收到消息,那是因为pandas库本身正在打印消息。除非您自己编辑Pandas源代码,否则您将无能为力。也许内部有一个抑制它们的选项,或者是一种覆盖事物的方法,但是我找不到。


对于那些需要知道为什么的人…

假设您要确保干净的工作环境。在脚本的顶部,放pd.reset_option('all')。使用Pandas 0.23.4,您将获得以下信息:

>>> import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)

C:\projects\stackoverflow\venv\lib\site-packages\pandas\core\config.py:619: FutureWarning: html.bord
er has been deprecated, use display.html.border instead
(currently both are identical)

  warnings.warn(d.msg, FutureWarning)

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

C:\projects\stackoverflow\venv\lib\site-packages\pandas\core\config.py:619: FutureWarning:
: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

  warnings.warn(d.msg, FutureWarning)

>>>

按照@bdiamante的建议,您可以使用该warnings库。现在,诚如其言,警告已被删除。但是,仍然存在一些令人讨厌的消息:

>>> import warnings
>>> warnings.simplefilter(action='ignore', category=FutureWarning)
>>> import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)


: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

>>>

实际上,禁用所有警告会产生相同的输出:

>>> import warnings
>>> warnings.simplefilter(action='ignore', category=Warning)
>>> import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)


: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

>>>

从标准库的角度来看,这些不是真正的警告。熊猫实施自己的警告系统。grep -rn在警告消息上运行表明,该pandas警告系统已在core/config_init.py以下位置实现:

$ grep -rn "html.border has been deprecated"
core/config_init.py:207:html.border has been deprecated, use display.html.border instead

进一步的追踪表明,我没有时间这样做。而且您可能也不是。希望这可以使您免于跌倒,或者可以激发某人找出如何真正压制这些消息的方法!

@bdiamante’s answer may only partially help you. If you still get a message after you’ve suppressed warnings, it’s because the pandas library itself is printing the message. There’s not much you can do about it unless you edit the Pandas source code yourself. Maybe there’s an option internally to suppress them, or a way to override things, but I couldn’t find one.


For those who need to know why…

Suppose that you want to ensure a clean working environment. At the top of your script, you put pd.reset_option('all'). With Pandas 0.23.4, you get the following:

>>> import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)

C:\projects\stackoverflow\venv\lib\site-packages\pandas\core\config.py:619: FutureWarning: html.bord
er has been deprecated, use display.html.border instead
(currently both are identical)

  warnings.warn(d.msg, FutureWarning)

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

C:\projects\stackoverflow\venv\lib\site-packages\pandas\core\config.py:619: FutureWarning:
: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

  warnings.warn(d.msg, FutureWarning)

>>>

Following the @bdiamante’s advice, you use the warnings library. Now, true to it’s word, the warnings have been removed. However, several pesky messages remain:

>>> import warnings
>>> warnings.simplefilter(action='ignore', category=FutureWarning)
>>> import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)


: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

>>>

In fact, disabling all warnings produces the same output:

>>> import warnings
>>> warnings.simplefilter(action='ignore', category=Warning)
>>> import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)


: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

>>>

In the standard library sense, these aren’t true warnings. Pandas implements its own warnings system. Running grep -rn on the warning messages shows that the pandas warning system is implemented in core/config_init.py:

$ grep -rn "html.border has been deprecated"
core/config_init.py:207:html.border has been deprecated, use display.html.border instead

Further chasing shows that I don’t have time for this. And you probably don’t either. Hopefully this saves you from falling down the rabbit hole or perhaps inspires someone to figure out how to truly suppress these messages!


回答 2

警告很烦人。如其他答案所述,您可以使用以下方法抑制它们:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

但是,如果要一一处理它们,并且要管理更大的代码库,将很难找到引起警告的代码行。由于警告与错误不同,因此代码回溯不会附带警告。为了跟踪类似错误的警告,您可以在代码顶部编写以下代码:

import warnings
warnings.filterwarnings("error")

但是,如果代码库更大,并且正在导入一堆其他库/程序包,则各种警告将开始作为错误发出。为了仅将某些类型的警告(在您的情况下为FutureWarning)引发为错误,您可以编写:

import warnings
warnings.simplefilter(action='error', category=FutureWarning)

Warnings are annoying. As mentioned in other answers, you can suppress them using:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

But if you want to handle them one by one and you are managing a bigger codebase, it will be difficult to find the line of code which is causing the warning. Since warnings unlike errors don’t come with code traceback. In order to trace warnings like errors, you can write this at the top of the code:

import warnings
warnings.filterwarnings("error")

But if the codebase is bigger and it is importing bunch of other libraries/packages, then all sort of warnings will start to be raised as errors. In order to raise only certain type of warnings (in your case, its FutureWarning) as error, you can write:

import warnings
warnings.simplefilter(action='error', category=FutureWarning)

熊猫使用什么规则生成视图与副本?

问题:熊猫使用什么规则生成视图与副本?

我对Pandas决定从数据框中进行选择是原始数据框的副本或原始数据视图时使用的规则感到困惑。

例如,如果我有

df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))

我了解a会query传回副本,因此类似

foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40

将对原始数据帧无效df。我也了解标量或命名切片返回一个视图,因此对它们的赋值(例如

df.iloc[3] = 70

要么

df.ix[1,'B':'E'] = 222

会改变df。但是当涉及到更复杂的案件时,我迷失了。例如,

df[df.C <= df.B] = 7654321

变化df,但是

df[df.C <= df.B].ix[:,'B':'E']

才不是。

是否有一个熊猫正在使用的简单规则,我只是想念它?在这些特定情况下发生了什么;尤其是,如何更改满足特定查询的数据框中的所有值(或值的子集)(就像我在上面的最后一个示例中尝试的那样)?


注意:这和这个问题不一样;并且我已经阅读了文档,但并未对此有所启发。我还阅读了有关此主题的“相关”问题,但我仍然缺少Pandas使用的简单规则,以及如何将其应用于(例如)修改值(或值的子集)在满足特定查询的数据框中。

I’m confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.

If I have, for example,

df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))

I understand that a query returns a copy so that something like

foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40

will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as

df.iloc[3] = 70

or

df.ix[1,'B':'E'] = 222

will change df. But I’m lost when it comes to more complicated cases. For example,

df[df.C <= df.B] = 7654321

changes df, but

df[df.C <= df.B].ix[:,'B':'E']

does not.

Is there a simple rule that Pandas is using that I’m just missing? What’s going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I’m attempting to do in the last example above)?


Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I’ve also read through the “Related” questions on this topic, but I’m still missing the simple rule Pandas is using, and how I’d apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.


回答 0

这是规则,其后是覆盖:

  • 所有操作都会生成一个副本

  • 如果inplace=True提供,它将原位修改;只有一些操作支持这一点

  • 设置的索引器,例如.loc/.iloc/.iat/.at将原地设置。

  • 到达单一类型对象的索引器几乎总是一个视图(取决于内存布局,这可能不是原因,这不可靠)。这主要是为了提高效率。(上面的示例用于.query;它将始终返回的副本,其值为numexpr

  • 到达多类型对象的索引器始终是副本。

您的例子 chained indexing

df[df.C <= df.B].loc[:,'B':'E']

不能保证能正常工作(因此您不应该这样做)。

而是:

df.loc[df.C <= df.B, 'B':'E']

因为这更快,并且将始终有效

链式索引是2个单独的python操作,因此无法可靠地被熊猫拦截(您通常会得到SettingWithCopyWarning,但也不是100%可检测到的)。您所指出的dev文档提供了更全面的说明。

Here’s the rules, subsequent override:

  • All operations generate a copy

  • If inplace=True is provided, it will modify in-place; only some operations support this

  • An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.

  • An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that’s why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)

  • An indexer that gets on a multiple-dtyped object is always a copy.

Your example of chained indexing

df[df.C <= df.B].loc[:,'B':'E']

is not guaranteed to work (and thus you shoulld never do this).

Instead do:

df.loc[df.C <= df.B, 'B':'E']

as this is faster and will always work

The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.