标签归档:dataframe

使用索引为pandas DataFrame中的特定单元格设置值

问题:使用索引为pandas DataFrame中的特定单元格设置值

我创建了一个Pandas DataFrame

df = DataFrame(index=['A','B','C'], columns=['x','y'])

并得到这个

    y
NaN NaN
B NaN NaN
Na


然后,我想为特定的单元格赋值,例如行“ C”和列“ x”。我期望得到这样的结果:

    y
NaN NaN
B NaN NaN
C 10 NaN

使用此代码:

df.xs('C')['x'] = 10

但内容df没有改变。再次仅NaN在DataFrame中。

有什么建议么?

I’ve created a Pandas DataFrame

df = DataFrame(index=['A','B','C'], columns=['x','y'])

and got this

    x    y
A  NaN  NaN
B  NaN  NaN
C  NaN  NaN


Then I want to assign value to particular cell, for example for row ‘C’ and column ‘x’. I’ve expected to get such result:

    x    y
A  NaN  NaN
B  NaN  NaN
C  10  NaN

with this code:

df.xs('C')['x'] = 10

but contents of df haven’t changed. It’s again only NaNs in DataFrame.

Any suggestions?


回答 0

RukTech的答案df.set_value('C', 'x', 10)远比我在下面建议的选项要快得多。但是,已将其淘汰

展望未来,推荐的方法是.iat/.at


为什么df.xs('C')['x']=10不起作用:

df.xs('C')默认情况下,返回带有数据副本的新数据框,因此

df.xs('C')['x']=10

仅修改此新数据框。

df['x']返回df数据框的视图,因此

df['x']['C'] = 10

修改df自己。

警告:有时很难预测操作是否返回副本或视图。因此,文档建议避免使用“链接索引”进行赋值


所以推荐的替代方法是

df.at['C', 'x'] = 10

修改df


In [18]: %timeit df.set_value('C', 'x', 10)
100000 loops, best of 3: 2.9 µs per loop

In [20]: %timeit df['x']['C'] = 10
100000 loops, best of 3: 6.31 µs per loop

In [81]: %timeit df.at['C', 'x'] = 10
100000 loops, best of 3: 9.2 µs per loop

RukTech’s answer, df.set_value('C', 'x', 10), is far and away faster than the options I’ve suggested below. However, it has been slated for deprecation.

Going forward, the recommended method is .iat/.at.


Why df.xs('C')['x']=10 does not work:

df.xs('C') by default, returns a new dataframe with a copy of the data, so

df.xs('C')['x']=10

modifies this new dataframe only.

df['x'] returns a view of the df dataframe, so

df['x']['C'] = 10

modifies df itself.

Warning: It is sometimes difficult to predict if an operation returns a copy or a view. For this reason the docs recommend avoiding assignments with “chained indexing”.


So the recommended alternative is

df.at['C', 'x'] = 10

which does modify df.


In [18]: %timeit df.set_value('C', 'x', 10)
100000 loops, best of 3: 2.9 µs per loop

In [20]: %timeit df['x']['C'] = 10
100000 loops, best of 3: 6.31 µs per loop

In [81]: %timeit df.at['C', 'x'] = 10
100000 loops, best of 3: 9.2 µs per loop

回答 1

更新:该.set_value方法将不推荐使用.iat/.at是很好的替代品,不幸的是熊猫提供的文件很少


最快的方法是使用set_value。该方法比.ix方法快100倍。例如:

df.set_value('C', 'x', 10)

Update: The .set_value method is going to be deprecated. .iat/.at are good replacements, unfortunately pandas provides little documentation


The fastest way to do this is using set_value. This method is ~100 times faster than .ix method. For example:

df.set_value('C', 'x', 10)


回答 2

您还可以使用条件查询,.loc如下所示:

df.loc[df[<some_column_name>] == <condition>, [<another_column_name>]] = <value_to_add>

<some_column_name您要在其中检查<condition>变量的列在哪里,您要添加到的列在哪里<another_column_name>(可以是新列,也可以是已经存在的列)。<value_to_add>是要添加到该列/行的值。

该示例不能完全解决当前的问题,但对于希望根据条件添加特定值的人来说可能很有用。

You can also use a conditional lookup using .loc as seen here:

df.loc[df[<some_column_name>] == <condition>, [<another_column_name>]] = <value_to_add>

where <some_column_name is the column you want to check the <condition> variable against and <another_column_name> is the column you want to add to (can be a new column or one that already exists). <value_to_add> is the value you want to add to that column/row.

This example doesn’t work precisely with the question at hand, but it might be useful for someone wants to add a specific value based on a condition.


回答 3

推荐的方法(根据维护者)是:

df.ix['x','C']=10

使用“链接索引”(df['x']['C'])可能会导致问题。

看到:

The recommended way (according to the maintainers) to set a value is:

df.ix['x','C']=10

Using ‘chained indexing’ (df['x']['C']) may lead to problems.

See:


回答 4

尝试使用 df.loc[row_index,col_indexer] = value

Try using df.loc[row_index,col_indexer] = value


回答 5

这是唯一对我有用的东西!

df.loc['C', 'x'] = 10

.loc 在此处了解更多信息。

This is the only thing that worked for me!

df.loc['C', 'x'] = 10

Learn more about .loc here.


回答 6

.iat/.at是很好的解决方案。假设您有一个简单的data_frame:

   A   B   C
0  1   8   4 
1  3   9   6
2  22 33  52

如果我们要修改单元格的值,则[0,"A"]可以使用以下解决方案之一:

  1. df.iat[0,0] = 2
  2. df.at[0,'A'] = 2

这是一个完整的示例,说明如何iat用于获取和设置cell的值:

def prepossessing(df):
  for index in range(0,len(df)): 
      df.iat[index,0] = df.iat[index,0] * 2
  return df

y_train之前:

    0
0   54
1   15
2   15
3   8
4   31
5   63
6   11

调用预设函数后的y_train iat进行更改,以将每个单元格的值乘以2:

     0
0   108
1   30
2   30
3   16
4   62
5   126
6   22

.iat/.at is the good solution. Supposing you have this simple data_frame:

   A   B   C
0  1   8   4 
1  3   9   6
2  22 33  52

if we want to modify the value of the cell [0,"A"] u can use one of those solution :

  1. df.iat[0,0] = 2
  2. df.at[0,'A'] = 2

And here is a complete example how to use iat to get and set a value of cell :

def prepossessing(df):
  for index in range(0,len(df)): 
      df.iat[index,0] = df.iat[index,0] * 2
  return df

y_train before :

    0
0   54
1   15
2   15
3   8
4   31
5   63
6   11

y_train after calling prepossessing function that iat to change to multiply the value of each cell by 2:

     0
0   108
1   30
2   30
3   16
4   62
5   126
6   22

回答 7

要设置值,请使用:

df.at[0, 'clm1'] = 0
  • 推荐的最快的设置变量的方法。
  • set_valueix已弃用。
  • 没有警告,与ilocloc

To set values, use:

df.at[0, 'clm1'] = 0
  • The fastest recommended method for setting variables.
  • set_value, ix have been deprecated.
  • No warning, unlike iloc and loc

回答 8

您可以使用.iloc

df.iloc[[2], [0]] = 10

you can use .iloc.

df.iloc[[2], [0]] = 10

回答 9

在我的示例中,我只是在选定的单元格中对其进行了更改

    for index, row in result.iterrows():
        if np.isnan(row['weight']):
            result.at[index, 'weight'] = 0.0

“结果”是带有“权重”列的dataField

In my example i just change it in selected cell

    for index, row in result.iterrows():
        if np.isnan(row['weight']):
            result.at[index, 'weight'] = 0.0

‘result’ is a dataField with column ‘weight’


回答 10

set_value() 不推荐使用。

从版本0.23.4开始,Pandas“ 宣布了未来 ”。

>>> df
                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        245.0
2      Chevrolet Malibu        190.0
>>> df.set_value(2, 'Prices (U$)', 240.0)
__main__:1: FutureWarning: set_value is deprecated and will be removed in a future release.
Please use .at[] or .iat[] accessors instead

                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        245.0
2      Chevrolet Malibu        240.0

考虑到此建议,下面是如何使用它们的演示:

  • 按行/列整数位置

>>> df.iat[1, 1] = 260.0
>>> df
                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        260.0
2      Chevrolet Malibu        240.0
  • 按行/列标签

>>> df.at[2, "Cars"] = "Chevrolet Corvette"
>>> df
                  Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        260.0
2    Chevrolet Corvette        240.0

参考文献:

set_value() is deprecated.

Starting from the release 0.23.4, Pandas “announces the future“…

>>> df
                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        245.0
2      Chevrolet Malibu        190.0
>>> df.set_value(2, 'Prices (U$)', 240.0)
__main__:1: FutureWarning: set_value is deprecated and will be removed in a future release.
Please use .at[] or .iat[] accessors instead

                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        245.0
2      Chevrolet Malibu        240.0

Considering this advice, here’s a demonstration of how to use them:

  • by row/column integer positions

>>> df.iat[1, 1] = 260.0
>>> df
                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        260.0
2      Chevrolet Malibu        240.0
  • by row/column labels

>>> df.at[2, "Cars"] = "Chevrolet Corvette"
>>> df
                  Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        260.0
2    Chevrolet Corvette        240.0

References:


回答 11

这是所有用户针对整数和字符串索引的数据帧提供的有效解决方案的摘要。

df.iloc,df.loc和df.at适用于两种类型的数据帧,df.iloc仅适用于行/列整数索引,df.loc和df.at支持使用列名和/或整数索引设置值。

如果指定的索引不存在,则df.loc和df.at都将新插入的行/列追加到现有数据帧中,但是df.iloc将引发“ IndexError:位置索引器越界”。在Python 2.7和3.7中测试的一个工作示例如下:

import numpy as np, pandas as pd

df1 = pd.DataFrame(index=np.arange(3), columns=['x','y','z'])
df1['x'] = ['A','B','C']
df1.at[2,'y'] = 400

# rows/columns specified does not exist, appends new rows/columns to existing data frame
df1.at['D','w'] = 9000
df1.loc['E','q'] = 499

# using df[<some_column_name>] == <condition> to retrieve target rows
df1.at[df1['x']=='B', 'y'] = 10000
df1.loc[df1['x']=='B', ['z','w']] = 10000

# using a list of index to setup values
df1.iloc[[1,2,4], 2] = 9999
df1.loc[[0,'D','E'],'w'] = 7500
df1.at[[0,2,"D"],'x'] = 10
df1.at[:, ['y', 'w']] = 8000

df1
>>> df1
     x     y     z     w      q
0   10  8000   NaN  8000    NaN
1    B  8000  9999  8000    NaN
2   10  8000  9999  8000    NaN
D   10  8000   NaN  8000    NaN
E  NaN  8000  9999  8000  499.0

Here is a summary of the valid solutions provided by all users, for data frames indexed by integer and string.

df.iloc, df.loc and df.at work for both type of data frames, df.iloc only works with row/column integer indices, df.loc and df.at supports for setting values using column names and / or integer indices.

When the specified index does not exist, both df.loc and df.at would append the newly inserted rows/columns to the existing data frame, but df.iloc would raise “IndexError: positional indexers are out-of-bounds”. A working example tested in Python 2.7 and 3.7 is as follows:

import numpy as np, pandas as pd

df1 = pd.DataFrame(index=np.arange(3), columns=['x','y','z'])
df1['x'] = ['A','B','C']
df1.at[2,'y'] = 400

# rows/columns specified does not exist, appends new rows/columns to existing data frame
df1.at['D','w'] = 9000
df1.loc['E','q'] = 499

# using df[<some_column_name>] == <condition> to retrieve target rows
df1.at[df1['x']=='B', 'y'] = 10000
df1.loc[df1['x']=='B', ['z','w']] = 10000

# using a list of index to setup values
df1.iloc[[1,2,4], 2] = 9999
df1.loc[[0,'D','E'],'w'] = 7500
df1.at[[0,2,"D"],'x'] = 10
df1.at[:, ['y', 'w']] = 8000

df1
>>> df1
     x     y     z     w      q
0   10  8000   NaN  8000    NaN
1    B  8000  9999  8000    NaN
2   10  8000  9999  8000    NaN
D   10  8000   NaN  8000    NaN
E  NaN  8000  9999  8000  499.0

回答 12

我进行了测试,输出df.set_value速度稍快一些,但是官方方法df.at似乎是最快且不建议使用的方法。

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(100, 100))

%timeit df.iat[50,50]=50 # ✓
%timeit df.at[50,50]=50 #  ✔
%timeit df.set_value(50,50,50) # will deprecate
%timeit df.iloc[50,50]=50
%timeit df.loc[50,50]=50

7.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.52 µs ± 64.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.68 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
98.7 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

请注意,这是为单个单元格设置值。对于向量lociloc由于它们已向量化,因此应该是更好的选择。

I tested and the output is df.set_value is little faster, but the official method df.at looks like the fastest non deprecated way to do it.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(100, 100))

%timeit df.iat[50,50]=50 # ✓
%timeit df.at[50,50]=50 #  ✔
%timeit df.set_value(50,50,50) # will deprecate
%timeit df.iloc[50,50]=50
%timeit df.loc[50,50]=50

7.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.52 µs ± 64.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.68 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
98.7 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Note this is setting the value for a single cell. For the vectors loc and iloc should be better options since they are vectorized.


回答 13

将索引与条件一起使用的一种方法是,首先获取满足您条件的所有行的索引,然后简单地以多种方式使用这些行索引

conditional_index = df.loc[ df['col name'] <condition> ].index

示例条件如下

==5, >10 , =="Any string", >= DateTime

然后,您可以通过多种方式使用这些行索引,例如

  1. 将一列的值替换为conditional_index
df.loc[conditional_index , [col name]]= <new value>
  1. 将多列的值替换为conditional_index
df.loc[conditional_index, [col1,col2]]= <new value>
  1. 保存conditional_index的一个好处是,您可以将一列的值分配给具有相同行索引的另一列
df.loc[conditional_index, [col1,col2]]= df.loc[conditional_index,'col name']

这都是可能的,因为.index返回一个索引数组,.loc可以将其与直接寻址一起使用,从而避免了一次又一次的遍历。

One way to use index with condition is first get the index of all the rows that satisfy your condition and then simply use those row indexes in a multiple of ways

conditional_index = df.loc[ df['col name'] <condition> ].index

Example condition is like

==5, >10 , =="Any string", >= DateTime

Then you can use these row indexes in variety of ways like

  1. Replace value of one column for conditional_index
df.loc[conditional_index , [col name]]= <new value>
  1. Replace value of multiple column for conditional_index
df.loc[conditional_index, [col1,col2]]= <new value>
  1. One benefit with saving the conditional_index is that you can assign value of one column to another column with same row index
df.loc[conditional_index, [col1,col2]]= df.loc[conditional_index,'col name']

This is all possible because .index returns a array of index which .loc can use with direct addressing so it avoids traversals again and again.


回答 14

df.loc['c','x']=10 这将更改第c行和 第x列的值。

df.loc['c','x']=10 This will change the value of cth row and xth column.


回答 15

除上述答案外,这是一个基准测试,比较了将数据行添加到现有数据框的不同方法。它表明对于大型数据框(至少对于这些测试条件),使用at或set-value是最有效的方法。

  • 为每一行创建新的数据框并…
    • …附加(13.0 s)
    • …将其串联(13.1 s)
  • 首先将所有新行存储在另一个容器中,一次转换为新数据框,然后追加…
    • container =清单清单(2.0 s)
    • container =列表字典(1.9 s)
  • 预分配整个数据框,遍历新行和所有列,并使用
    • …在(0.6 s)
    • … set_value(0.4秒)

对于测试,使用了一个包含100,000行和1,000列以及随机numpy值的现有数据框。向该数据框添加了100个新行。

代码见下:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 21 16:38:46 2018

@author: gebbissimo
"""

import pandas as pd
import numpy as np
import time

NUM_ROWS = 100000
NUM_COLS = 1000
data = np.random.rand(NUM_ROWS,NUM_COLS)
df = pd.DataFrame(data)

NUM_ROWS_NEW = 100
data_tot = np.random.rand(NUM_ROWS + NUM_ROWS_NEW,NUM_COLS)
df_tot = pd.DataFrame(data_tot)

DATA_NEW = np.random.rand(1,NUM_COLS)


#%% FUNCTIONS

# create and append
def create_and_append(df):
    for i in range(NUM_ROWS_NEW):
        df_new = pd.DataFrame(DATA_NEW)
        df = df.append(df_new)
    return df

# create and concatenate
def create_and_concat(df):
    for i in range(NUM_ROWS_NEW):
        df_new = pd.DataFrame(DATA_NEW)
        df = pd.concat((df, df_new))
    return df


# store as dict and 
def store_as_list(df):
    lst = [[] for i in range(NUM_ROWS_NEW)]
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            lst[i].append(DATA_NEW[0,j])
    df_new = pd.DataFrame(lst)
    df_tot = df.append(df_new)
    return df_tot

# store as dict and 
def store_as_dict(df):
    dct = {}
    for j in range(NUM_COLS):
        dct[j] = []
        for i in range(NUM_ROWS_NEW):
            dct[j].append(DATA_NEW[0,j])
    df_new = pd.DataFrame(dct)
    df_tot = df.append(df_new)
    return df_tot




# preallocate and fill using .at
def fill_using_at(df):
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            #print("i,j={},{}".format(i,j))
            df.at[NUM_ROWS+i,j] = DATA_NEW[0,j]
    return df


# preallocate and fill using .at
def fill_using_set(df):
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            #print("i,j={},{}".format(i,j))
            df.set_value(NUM_ROWS+i,j,DATA_NEW[0,j])
    return df


#%% TESTS
t0 = time.time()    
create_and_append(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
create_and_concat(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
store_as_list(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
store_as_dict(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
fill_using_at(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
fill_using_set(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

In addition to the answers above, here is a benchmark comparing different ways to add rows of data to an already existing dataframe. It shows that using at or set-value is the most efficient way for large dataframes (at least for these test conditions).

  • Create new dataframe for each row and…
    • … append it (13.0 s)
    • … concatenate it (13.1 s)
  • Store all new rows in another container first, convert to new dataframe once and append…
    • container = lists of lists (2.0 s)
    • container = dictionary of lists (1.9 s)
  • Preallocate whole dataframe, iterate over new rows and all columns and fill using
    • … at (0.6 s)
    • … set_value (0.4 s)

For the test, an existing dataframe comprising 100,000 rows and 1,000 columns and random numpy values was used. To this dataframe, 100 new rows were added.

Code see below:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 21 16:38:46 2018

@author: gebbissimo
"""

import pandas as pd
import numpy as np
import time

NUM_ROWS = 100000
NUM_COLS = 1000
data = np.random.rand(NUM_ROWS,NUM_COLS)
df = pd.DataFrame(data)

NUM_ROWS_NEW = 100
data_tot = np.random.rand(NUM_ROWS + NUM_ROWS_NEW,NUM_COLS)
df_tot = pd.DataFrame(data_tot)

DATA_NEW = np.random.rand(1,NUM_COLS)


#%% FUNCTIONS

# create and append
def create_and_append(df):
    for i in range(NUM_ROWS_NEW):
        df_new = pd.DataFrame(DATA_NEW)
        df = df.append(df_new)
    return df

# create and concatenate
def create_and_concat(df):
    for i in range(NUM_ROWS_NEW):
        df_new = pd.DataFrame(DATA_NEW)
        df = pd.concat((df, df_new))
    return df


# store as dict and 
def store_as_list(df):
    lst = [[] for i in range(NUM_ROWS_NEW)]
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            lst[i].append(DATA_NEW[0,j])
    df_new = pd.DataFrame(lst)
    df_tot = df.append(df_new)
    return df_tot

# store as dict and 
def store_as_dict(df):
    dct = {}
    for j in range(NUM_COLS):
        dct[j] = []
        for i in range(NUM_ROWS_NEW):
            dct[j].append(DATA_NEW[0,j])
    df_new = pd.DataFrame(dct)
    df_tot = df.append(df_new)
    return df_tot




# preallocate and fill using .at
def fill_using_at(df):
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            #print("i,j={},{}".format(i,j))
            df.at[NUM_ROWS+i,j] = DATA_NEW[0,j]
    return df


# preallocate and fill using .at
def fill_using_set(df):
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            #print("i,j={},{}".format(i,j))
            df.set_value(NUM_ROWS+i,j,DATA_NEW[0,j])
    return df


#%% TESTS
t0 = time.time()    
create_and_append(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
create_and_concat(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
store_as_list(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
store_as_dict(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
fill_using_at(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
fill_using_set(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

回答 16

如果您不想更改整个行的值,而只更改某些列的值:

x = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
x.iloc[1] = dict(A=10, B=-10)

If you want to change values not for whole row, but only for some columns:

x = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
x.iloc[1] = dict(A=10, B=-10)

回答 17

从0.21.1版开始,您还可以使用.at方法。与.loc此处提到的相比-Pandas .at与.loc有一些区别,但是在单值替换上它更快

From version 0.21.1 you can also use .at method. There are some differences compared to .loc as mentioned here – pandas .at versus .loc, but it’s faster on single value replacement


回答 18

如此,您的问题是将[‘x’,C]的NaN转换为值10

答案是..

df['x'].loc['C':]=10
df

替代代码是

df.loc['C':'x']=10
df

Soo, your question to convert NaN at [‘x’,C] to value 10

the answer is..

df['x'].loc['C':]=10
df

alternative code is

df.loc['C':'x']=10
df

回答 19

我也在寻找这个主题,并且提出了一种方法来遍历DataFrame并使用第二个DataFrame中的查找值对其进行更新。这是我的代码。

src_df = pd.read_sql_query(src_sql,src_connection)
for index1, row1 in src_df.iterrows():
    for index, row in vertical_df.iterrows():
        src_df.set_value(index=index1,col=u'etl_load_key',value=etl_load_key)
        if (row1[u'src_id'] == row['SRC_ID']) is True:
            src_df.set_value(index=index1,col=u'vertical',value=row['VERTICAL'])

I too was searching for this topic and I put together a way to iterate through a DataFrame and update it with lookup values from a second DataFrame. Here is my code.

src_df = pd.read_sql_query(src_sql,src_connection)
for index1, row1 in src_df.iterrows():
    for index, row in vertical_df.iterrows():
        src_df.set_value(index=index1,col=u'etl_load_key',value=etl_load_key)
        if (row1[u'src_id'] == row['SRC_ID']) is True:
            src_df.set_value(index=index1,col=u'vertical',value=row['VERTICAL'])

Pandas中map,applymap和apply方法之间的区别

问题:Pandas中map,applymap和apply方法之间的区别

您能否通过基本示例告诉我何时使用这些矢量化方法?

我看到这map是一种Series方法,而其余都是DataFrame方法。我糊涂了约applyapplymap,虽然方法。为什么我们有两种将函数应用于DataFrame的方法?同样,简单的例子可以很好地说明用法!

Can you tell me when to use these vectorization methods with basic examples?

I see that map is a Series method whereas the rest are DataFrame methods. I got confused about apply and applymap methods though. Why do we have two methods for applying a function to a DataFrame? Again, simple examples which illustrate the usage would be great!


回答 0

直接来自Wes McKinney的《Python for Data Analysis》一书,第16页。132(我强烈推荐这本书):

另一种常见的操作是将一维数组上的函数应用于每一列或每一行。DataFrame的apply方法正是这样做的:

In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [117]: frame
Out[117]: 
               b         d         e
Utah   -0.029638  1.081563  1.280300
Ohio    0.647747  0.831136 -1.549481
Texas   0.513416 -0.884417  0.195343
Oregon -0.485454 -0.477388 -0.309548

In [118]: f = lambda x: x.max() - x.min()

In [119]: frame.apply(f)
Out[119]: 
b    1.133201
d    1.965980
e    2.829781
dtype: float64

许多最常见的数组统计信息(例如sum和mean)都是DataFrame方法,因此不必使用apply。

也可以使用基于元素的Python函数。假设您要根据帧中的每个浮点值来计算格式化的字符串。您可以使用applymap做到这一点:

In [120]: format = lambda x: '%.2f' % x

In [121]: frame.applymap(format)
Out[121]: 
            b      d      e
Utah    -0.03   1.08   1.28
Ohio     0.65   0.83  -1.55
Texas    0.51  -0.88   0.20
Oregon  -0.49  -0.48  -0.31

之所以使用applymap之所以命名,是因为Series具有用于应用逐元素函数的map方法:

In [122]: frame['e'].map(format)
Out[122]: 
Utah       1.28
Ohio      -1.55
Texas      0.20
Oregon    -0.31
Name: e, dtype: object

总结起来,apply在DataFrame的行/列基础上工作,在DataFrame applymap上按map元素工作,在Series上按元素工作。

Straight from Wes McKinney’s Python for Data Analysis book, pg. 132 (I highly recommended this book):

Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:

In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [117]: frame
Out[117]: 
               b         d         e
Utah   -0.029638  1.081563  1.280300
Ohio    0.647747  0.831136 -1.549481
Texas   0.513416 -0.884417  0.195343
Oregon -0.485454 -0.477388 -0.309548

In [118]: f = lambda x: x.max() - x.min()

In [119]: frame.apply(f)
Out[119]: 
b    1.133201
d    1.965980
e    2.829781
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:

In [120]: format = lambda x: '%.2f' % x

In [121]: frame.applymap(format)
Out[121]: 
            b      d      e
Utah    -0.03   1.08   1.28
Ohio     0.65   0.83  -1.55
Texas    0.51  -0.88   0.20
Oregon  -0.49  -0.48  -0.31

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [122]: frame['e'].map(format)
Out[122]: 
Utah       1.28
Ohio      -1.55
Texas      0.20
Oregon    -0.31
Name: e, dtype: object

Summing up, apply works on a row / column basis of a DataFrame, applymap works element-wise on a DataFrame, and map works element-wise on a Series.


回答 1

比较mapapplymap和:上下文问题apply

第一个主要区别:定义

  • map 仅在系列上定义
  • applymap 仅在DataFrames上定义
  • apply 两者都定义

第二个主要区别:输入参数

  • map接受dictS, Series,或可调用
  • applymap并且apply只接受可调用

第三大区别:行为

  • map 对于系列是元素
  • applymap 对于DataFrames是元素
  • apply也可以逐元素工作,但适用于更复杂的操作和聚合。行为和返回值取决于函数。

四主要的区别(最重要的):用例

  • map是用于将值从一个域映射到另一个域,因此针对性能进行了优化(例如df['A'].map({1:'a', 2:'b', 3:'c'})
  • applymap适用于跨多个行/列的元素转换(例如df[['A', 'B', 'C']].applymap(str.strip)
  • apply用于应用无法向量化的任何功能(例如df['sentences'].apply(nltk.sent_tokenize)

总结

在此处输入图片说明

脚注

  1. map通过字典/系列时,将基于该字典/系列中的键映射元素。缺少的值将在输出中记录为NaN。
  2. applymap在最新版本中,已针对某些操作进行了优化。您会发现applymapapply某些情况下要快一些。我的建议是对它们都进行测试,并使用更好的方法。

  3. map针对元素映射和转换进行了优化。涉及字典或系列的操作将使熊猫能够使用更快的代码路径来获得更好的性能。

  4. Series.apply返回用于汇总操作的标量,否则返回Series。同样适用于DataFrame.apply。需要注意的是apply,当某些NumPy的功能,如所谓的也有FastPaths的meansum等等。

Comparing map, applymap and apply: Context Matters

First major difference: DEFINITION

  • map is defined on Series ONLY
  • applymap is defined on DataFrames ONLY
  • apply is defined on BOTH

Second major difference: INPUT ARGUMENT

  • map accepts dicts, Series, or callable
  • applymap and apply accept callables only

Third major difference: BEHAVIOR

  • map is elementwise for Series
  • applymap is elementwise for DataFrames
  • apply also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.

Fourth major difference (the most important one): USE CASE

  • map is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))
  • applymap is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))
  • apply is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize))

Summarising

enter image description here

Footnotes

  1. map when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as NaN in the output.
  2. applymap in more recent versions has been optimised for some operations. You will find applymap slightly faster than apply in some cases. My suggestion is to test them both and use whatever works better.

  3. map is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance.

  4. Series.apply returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply. Note that apply also has fastpaths when called with certain NumPy functions such as mean, sum, etc.

回答 2

这些答案中有很多有用的信息,但是我要添加自己的信息,以明确总结哪些方法在数组方式与元素方式下均有效。jeremiahbuddha主要这样做,但没有提及Series.apply。我没有代表对此发表评论。

  • DataFrame.apply 一次对整个行或列进行操作。

  • DataFrame.applymap,,Series.apply并同时Series.map对一个元素进行操作。

Series.apply和的功能之间有很多重叠之处Series.map,这意味着任何一种在大多数情况下都可以使用。但是,它们确实有一些细微的差异,其中一些已在osa的答案中进行了讨论。

There’s great information in these answers, but I’m adding my own to clearly summarize which methods work array-wise versus element-wise. jeremiahbuddha mostly did this but did not mention Series.apply. I don’t have the rep to comment.

  • DataFrame.apply operates on entire rows or columns at a time.

  • DataFrame.applymap, Series.apply, and Series.map operate on one element at time.

There is a lot of overlap between the capabilities of Series.apply and Series.map, meaning that either one will work in most cases. They do have some slight differences though, some of which were discussed in osa’s answer.


回答 3

除了其他答案外,Series还有mapapply

Apply可以使DataFrame脱离系列;但是,map只会将一个系列放在另一个系列的每个单元格中,这可能不是您想要的。

In [40]: p=pd.Series([1,2,3])
In [41]: p
Out[31]:
0    1
1    2
2    3
dtype: int64

In [42]: p.apply(lambda x: pd.Series([x, x]))
Out[42]: 
   0  1
0  1  1
1  2  2
2  3  3

In [43]: p.map(lambda x: pd.Series([x, x]))
Out[43]: 
0    0    1
1    1
dtype: int64
1    0    2
1    2
dtype: int64
2    0    3
1    3
dtype: int64
dtype: object

另外,如果我有一个带有副作用的功能,例如“连接到Web服务器”,那么我可能apply只是为了清楚起见而使用。

series.apply(download_file_for_every_element) 

Map不仅可以使用功能,还可以使用字典或其他系列。假设您要操纵排列

采取

1 2 3 4 5
2 1 4 5 3

此排列的平方是

1 2 3 4 5
1 2 5 3 4

您可以使用进行计算map。不知道自助申请是否已记录在案,但可以在中使用0.15.1

In [39]: p=pd.Series([1,0,3,4,2])

In [40]: p.map(p)
Out[40]: 
0    0
1    1
2    4
3    2
4    3
dtype: int64

Adding to the other answers, in a Series there are also map and apply.

Apply can make a DataFrame out of a series; however, map will just put a series in every cell of another series, which is probably not what you want.

In [40]: p=pd.Series([1,2,3])
In [41]: p
Out[31]:
0    1
1    2
2    3
dtype: int64

In [42]: p.apply(lambda x: pd.Series([x, x]))
Out[42]: 
   0  1
0  1  1
1  2  2
2  3  3

In [43]: p.map(lambda x: pd.Series([x, x]))
Out[43]: 
0    0    1
1    1
dtype: int64
1    0    2
1    2
dtype: int64
2    0    3
1    3
dtype: int64
dtype: object

Also if I had a function with side effects, such as “connect to a web server”, I’d probably use apply just for the sake of clarity.

series.apply(download_file_for_every_element) 

Map can use not only a function, but also a dictionary or another series. Let’s say you want to manipulate permutations.

Take

1 2 3 4 5
2 1 4 5 3

The square of this permutation is

1 2 3 4 5
1 2 5 3 4

You can compute it using map. Not sure if self-application is documented, but it works in 0.15.1.

In [39]: p=pd.Series([1,0,3,4,2])

In [40]: p.map(p)
Out[40]: 
0    0
1    1
2    4
3    2
4    3
dtype: int64

回答 4

@jeremiahbuddha提到了apply在行/列上的工作,而applymap在元素上工作。但似乎您仍可以使用apply进行元素计算。

    frame.apply(np.sqrt)
    Out[102]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

    frame.applymap(np.sqrt)
    Out[103]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

@jeremiahbuddha mentioned that apply works on row/columns, while applymap works element-wise. But it seems you can still use apply for element-wise computation….

    frame.apply(np.sqrt)
    Out[102]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

    frame.applymap(np.sqrt)
    Out[103]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

回答 5

只是想指出一点,因为我为此苦了一点

def f(x):
    if x < 0:
        x = 0
    elif x > 100000:
        x = 100000
    return x

df.applymap(f)
df.describe()

这不会修改数据框本身,必须重新分配

df = df.applymap(f)
df.describe()

Just wanted to point out, as I struggled with this for a bit

def f(x):
    if x < 0:
        x = 0
    elif x > 100000:
        x = 100000
    return x

df.applymap(f)
df.describe()

this does not modify the dataframe itself, has to be reassigned

df = df.applymap(f)
df.describe()

回答 6

可能最简单的解释apply和applymap之间的区别:

apply将整个列作为参数,然后将结果分配给该列

applymap将单独的单元格值作为参数,并将结果分配回该单元格。

注意:如果apply返回单个值,则分配后将具有该值而不是列,最终将仅具有一行而不是矩阵。

Probably simplest explanation the difference between apply and applymap:

apply takes the whole column as a parameter and then assign the result to this column

applymap takes the separate cell value as a parameter and assign the result back to this cell.

NB If apply returns the single value you will have this value instead of the column after assigning and eventually will have just a row instead of matrix.


回答 7

我的理解:

从功能上看:

如果函数具有需要在列/行中进行比较的变量,请使用 apply

例如:lambda x: x.max()-x.mean()

如果要将函数应用于每个元素:

1>如果找到列/行,请使用 apply

2>如果适用于整个数据框,请使用 applymap

majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)

def times10(x):
  if type(x) is int:
    x *= 10 
  return x
df2.applymap(times10)

My understanding:

From the function point of view:

If the function has variables that need to compare within a column/ row, use apply.

e.g.: lambda x: x.max()-x.mean().

If the function is to be applied to each element:

1> If a column/row is located, use apply

2> If apply to entire dataframe, use applymap

majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)

def times10(x):
  if type(x) is int:
    x *= 10 
  return x
df2.applymap(times10)

回答 8

基于cs95的答案

  • map 仅在系列上定义
  • applymap 仅在DataFrames上定义
  • apply 两者都定义

举一些例子

In [3]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [4]: frame
Out[4]:
            b         d         e
Utah    0.129885 -0.475957 -0.207679
Ohio   -2.978331 -1.015918  0.784675
Texas  -0.256689 -0.226366  2.262588
Oregon  2.605526  1.139105 -0.927518

In [5]: myformat=lambda x: f'{x:.2f}'

In [6]: frame.d.map(myformat)
Out[6]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [7]: frame.d.apply(myformat)
Out[7]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [8]: frame.applymap(myformat)
Out[8]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93

In [9]: frame.apply(lambda x: x.apply(myformat))
Out[9]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93


In [10]: myfunc=lambda x: x**2

In [11]: frame.applymap(myfunc)
Out[11]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

In [12]: frame.apply(myfunc)
Out[12]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

Based on the answer of cs95

  • map is defined on Series ONLY
  • applymap is defined on DataFrames ONLY
  • apply is defined on BOTH

give some examples

In [3]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [4]: frame
Out[4]:
            b         d         e
Utah    0.129885 -0.475957 -0.207679
Ohio   -2.978331 -1.015918  0.784675
Texas  -0.256689 -0.226366  2.262588
Oregon  2.605526  1.139105 -0.927518

In [5]: myformat=lambda x: f'{x:.2f}'

In [6]: frame.d.map(myformat)
Out[6]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [7]: frame.d.apply(myformat)
Out[7]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [8]: frame.applymap(myformat)
Out[8]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93

In [9]: frame.apply(lambda x: x.apply(myformat))
Out[9]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93


In [10]: myfunc=lambda x: x**2

In [11]: frame.applymap(myfunc)
Out[11]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

In [12]: frame.apply(myfunc)
Out[12]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

回答 9

FOMO:

以下示例显示applyapplymap应用于DataFrame

map函数仅适用于Series。您不能map 在DataFrame上申请。

需要记住的是, apply可以做任何事情 applymap都可以,但apply额外的选项。

X因子选项包括:axisresult_typeresult_type仅在axis=1(仅适用于)时才适用。

df = DataFrame(1, columns=list('abc'),
                  index=list('1234'))
print(df)

f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only

# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1))  # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result

map附带说明一下,Series 函数不应与Python map函数混淆。

第一个应用于Series,以映射值,第二个应用于迭代对象的每个项目。


最后,不要将dataframe apply方法与groupby apply方法混淆。

FOMO:

The following example shows apply and applymap applied to a DataFrame.

map function is something you do apply on Series only. You cannot apply map on DataFrame.

The thing to remember is that apply can do anything applymap can, but apply has eXtra options.

The X factor options are: axis and result_type where result_type only works when axis=1 (for columns).

df = DataFrame(1, columns=list('abc'),
                  index=list('1234'))
print(df)

f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only

# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1))  # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result

As a sidenote, Series map function, should not be confused with the Python map function.

The first one is applied on Series, to map the values, and the second one to every item of an iterable.


Lastly don’t confuse the dataframe apply method with groupby apply method.


将pandas数据框转换为NumPy数组

问题:将pandas数据框转换为NumPy数组

我对知道如何将熊猫数据框转换为NumPy数组感兴趣。

数据框:

import numpy as np
import pandas as pd

index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')

label   A    B    C
ID                                 
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

我想将其转换为NumPy数组,如下所示:

array([[ nan,  0.2,  nan],
       [ nan,  nan,  0.5],
       [ nan,  0.2,  0.5],
       [ 0.1,  0.2,  nan],
       [ 0.1,  0.2,  0.5],
       [ 0.1,  nan,  0.5],
       [ 0.1,  nan,  nan]])

我怎样才能做到这一点?


另外,是否可以像这样保留dtype?

array([[ 1, nan,  0.2,  nan],
       [ 2, nan,  nan,  0.5],
       [ 3, nan,  0.2,  0.5],
       [ 4, 0.1,  0.2,  nan],
       [ 5, 0.1,  0.2,  0.5],
       [ 6, 0.1,  nan,  0.5],
       [ 7, 0.1,  nan,  nan]],
     dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

或类似的?

I am interested in knowing how to convert a pandas dataframe into a NumPy array.

dataframe:

import numpy as np
import pandas as pd

index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')

gives

label   A    B    C
ID                                 
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

I would like to convert this to a NumPy array, as so:

array([[ nan,  0.2,  nan],
       [ nan,  nan,  0.5],
       [ nan,  0.2,  0.5],
       [ 0.1,  0.2,  nan],
       [ 0.1,  0.2,  0.5],
       [ 0.1,  nan,  0.5],
       [ 0.1,  nan,  nan]])

How can I do this?


As a bonus, is it possible to preserve the dtypes, like this?

array([[ 1, nan,  0.2,  nan],
       [ 2, nan,  nan,  0.5],
       [ 3, nan,  0.2,  0.5],
       [ 4, 0.1,  0.2,  nan],
       [ 5, 0.1,  0.2,  0.5],
       [ 6, 0.1,  nan,  0.5],
       [ 7, 0.1,  nan,  nan]],
     dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

or similar?


回答 0

要将熊猫数据框(df)转换为numpy ndarray,请使用以下代码:

df.values

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

To convert a pandas dataframe (df) to a numpy ndarray, use this code:

df.values

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

回答 1

弃用的用法valuesas_matrix()

pandas v0.24.0引入了两种从pandas对象获取NumPy数组的新方法:

  1. to_numpy(),其定义上IndexSeries,DataFrame对象,并
  2. array,仅在IndexSeries对象上定义。

如果您访问的v0.24文档.values,则会看到一个红色的大警告:

警告:我们建议DataFrame.to_numpy()改为使用。

请参阅v0.24.0发行说明的本部分以及此答案以获取更多信息。


追求更好的一致性: to_numpy()

本着整个API更好的一致性的精神,to_numpy引入了一种新方法来从DataFrames中提取底层的NumPy数组。

# Setup.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

df.to_numpy()
array([[1, 4],
       [2, 5],
       [3, 6]])

如上所述,该方法也在IndexSeries对象上定义(请参见此处)。

df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)

df['A'].to_numpy()
#  array([1, 2, 3])

默认情况下,将返回视图,因此所做的任何修改都会影响原始视图。

v = df.to_numpy()
v[0, 0] = -1

df
   A  B
a -1  4
b  2  5
c  3  6

如果您需要副本,请使用to_numpy(copy=True

Pandas> = 1.0更新为ExtensionTypes

如果您使用的是熊猫1.x,那么您可能会更多地处理扩展类型。您必须多加注意,这些扩展名类型已正确转换。

a = pd.array([1, 2, None], dtype="Int64")                                  
a                                                                          

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64 

# Wrong
a.to_numpy()                                                               
# array([1, 2, <NA>], dtype=object)  # yuck, objects

# Right
a.to_numpy(dtype='float', na_value=np.nan)                                 
# array([ 1.,  2., nan])

在文档中对此进行了标注

如果您需要dtypes

如另一个答案所示,DataFrame.to_records是执行此操作的好方法。

df.to_records()
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8')])

to_numpy不幸的是,这不能做到。但是,您也可以使用np.rec.fromrecords

v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#          dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8')])

在性能方面,它几乎是相同的(实际上,使用rec.fromrecords速度要快一些)。

df2 = pd.concat([df] * 10000)

%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())

11.1 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.67 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

添加新方法的理由

to_numpy()(除了array)作为两个GitHub问题GH19954GH23623下讨论的结果而添加

具体来说,文档提到了基本原理:

[…] .values尚不清楚返回的值是实际数组,它的某种转换还是熊猫自定义数组之一(如Categorical)。例如,使用PeriodIndex,每次都会.values 生成一个新ndarray的周期对象。[…]

to_numpy旨在提高API的一致性,这是朝正确方向迈出的重要一步。.values不会在当前版本中被弃用,但我希望这种情况可能会在将来的某个时刻发生,因此,我敦促用户尽快向较新的API迁移。


批判其他解决方案

DataFrame.values 如前所述,具有不一致的行为。

DataFrame.get_values()只是一个包装器DataFrame.values,因此上述所有内容均适用。

DataFrame.as_matrix()现在已弃用,请勿使用!

Deprecate your usage of values and as_matrix()!

pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects:

  1. to_numpy(), which is defined on Index, Series, and DataFrame objects, and
  2. array, which is defined on Index and Series objects only.

If you visit the v0.24 docs for .values, you will see a big red warning that says:

Warning: We recommend using DataFrame.to_numpy() instead.

See this section of the v0.24.0 release notes, and this answer for more information.


Towards Better Consistency: to_numpy()

In the spirit of better consistency throughout the API, a new method to_numpy has been introduced to extract the underlying NumPy array from DataFrames.

# Setup.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

df.to_numpy()
array([[1, 4],
       [2, 5],
       [3, 6]])

As mentioned above, this method is also defined on Index and Series objects (see here).

df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)

df['A'].to_numpy()
#  array([1, 2, 3])

By default, a view is returned, so any modifications made will affect the original.

v = df.to_numpy()
v[0, 0] = -1

df
   A  B
a -1  4
b  2  5
c  3  6

If you need a copy instead, use to_numpy(copy=True).

pandas >= 1.0 update for ExtensionTypes

If you’re using pandas 1.x, chances are you’ll be dealing with extension types a lot more. You’ll have to be a little more careful that these extension types are correctly converted.

a = pd.array([1, 2, None], dtype="Int64")                                  
a                                                                          

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64 

# Wrong
a.to_numpy()                                                               
# array([1, 2, <NA>], dtype=object)  # yuck, objects

# Right
a.to_numpy(dtype='float', na_value=np.nan)                                 
# array([ 1.,  2., nan])

This is called out in the docs.

If you need the dtypes

As shown in another answer, DataFrame.to_records is a good way to do this.

df.to_records()
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8')])

This cannot be done with to_numpy, unfortunately. However, as an alternative, you can use np.rec.fromrecords:

v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#          dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8')])

Performance wise, it’s nearly the same (actually, using rec.fromrecords is a bit faster).

df2 = pd.concat([df] * 10000)

%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())

11.1 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.67 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Rationale for Adding a New Method

to_numpy() (in addition to array) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[…] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. […]

to_numpy aim to improve the consistency of the API, which is a major step in the right direction. .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.


Critique of Other Solutions

DataFrame.values has inconsistent behaviour, as already noted.

DataFrame.get_values() is simply a wrapper around DataFrame.values, so everything said above applies.

DataFrame.as_matrix() is deprecated now, do NOT use!


回答 2

注意.as_matrix()不建议使用此答案中的方法。熊猫0.23.4警告:

方法.as_matrix将在以后的版本中删除。请改用.values。


熊猫内置一些东西…

numpy_matrix = df.as_matrix()

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

Note: The .as_matrix() method used in this answer is deprecated. Pandas 0.23.4 warns:

Method .as_matrix will be removed in a future version. Use .values instead.


Pandas has something built in…

numpy_matrix = df.as_matrix()

gives

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

回答 3

我只需要链接DataFrame.reset_index()DataFrame.values函数来获得数据帧的Numpy表示,包括索引:

In [8]: df
Out[8]: 
          A         B         C
0 -0.982726  0.150726  0.691625
1  0.617297 -0.471879  0.505547
2  0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758  1.178659
4 -0.164103  0.074516 -0.674325
5 -0.340169 -0.293698  1.231791
6 -1.062825  0.556273  1.508058
7  0.959610  0.247539  0.091333

[8 rows x 3 columns]

In [9]: df.reset_index().values
Out[9]:
array([[ 0.        , -0.98272574,  0.150726  ,  0.69162512],
       [ 1.        ,  0.61729734, -0.47187926,  0.50554728],
       [ 2.        ,  0.4171228 , -1.35680324, -1.01349922],
       [ 3.        , -0.16636303, -0.95775849,  1.17865945],
       [ 4.        , -0.16410334,  0.0745164 , -0.67432474],
       [ 5.        , -0.34016865, -0.29369841,  1.23179064],
       [ 6.        , -1.06282542,  0.55627285,  1.50805754],
       [ 7.        ,  0.95961001,  0.24753911,  0.09133339]])

为了获得dtypes,我们需要使用view将此ndarray转换为结构化数组:

In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574,  0.150726  ,  0.69162512),
       ( 1,  0.61729734, -0.47187926,  0.50554728),
       ( 2,  0.4171228 , -1.35680324, -1.01349922),
       ( 3, -0.16636303, -0.95775849,  1.17865945),
       ( 4, -0.16410334,  0.0745164 , -0.67432474),
       ( 5, -0.34016865, -0.29369841,  1.23179064),
       ( 6, -1.06282542,  0.55627285,  1.50805754),
       ( 7,  0.95961001,  0.24753911,  0.09133339),
       dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

I would just chain the DataFrame.reset_index() and DataFrame.values functions to get the Numpy representation of the dataframe, including the index:

In [8]: df
Out[8]: 
          A         B         C
0 -0.982726  0.150726  0.691625
1  0.617297 -0.471879  0.505547
2  0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758  1.178659
4 -0.164103  0.074516 -0.674325
5 -0.340169 -0.293698  1.231791
6 -1.062825  0.556273  1.508058
7  0.959610  0.247539  0.091333

[8 rows x 3 columns]

In [9]: df.reset_index().values
Out[9]:
array([[ 0.        , -0.98272574,  0.150726  ,  0.69162512],
       [ 1.        ,  0.61729734, -0.47187926,  0.50554728],
       [ 2.        ,  0.4171228 , -1.35680324, -1.01349922],
       [ 3.        , -0.16636303, -0.95775849,  1.17865945],
       [ 4.        , -0.16410334,  0.0745164 , -0.67432474],
       [ 5.        , -0.34016865, -0.29369841,  1.23179064],
       [ 6.        , -1.06282542,  0.55627285,  1.50805754],
       [ 7.        ,  0.95961001,  0.24753911,  0.09133339]])

To get the dtypes we’d need to transform this ndarray into a structured array using view:

In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574,  0.150726  ,  0.69162512),
       ( 1,  0.61729734, -0.47187926,  0.50554728),
       ( 2,  0.4171228 , -1.35680324, -1.01349922),
       ( 3, -0.16636303, -0.95775849,  1.17865945),
       ( 4, -0.16410334,  0.0745164 , -0.67432474),
       ( 5, -0.34016865, -0.29369841,  1.23179064),
       ( 6, -1.06282542,  0.55627285,  1.50805754),
       ( 7,  0.95961001,  0.24753911,  0.09133339),
       dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

回答 4

您可以使用该to_records方法,但是如果dtypes不是您一开始就想要的,就必须多花点时间。就我而言,从字符串复制了DF,索引类型为字符串(以objectdtypes在pandas中表示):

In [102]: df
Out[102]: 
label    A    B    C
ID                  
1      NaN  0.2  NaN
2      NaN  NaN  0.5
3      NaN  0.2  0.5
4      0.1  0.2  NaN
5      0.1  0.2  0.5
6      0.1  NaN  0.5
7      0.1  NaN  NaN

In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]: 
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

转换recarray dtype对我不起作用,但是已经可以在Pandas中做到这一点:

In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

请注意,Pandas ID在导出的记录数组中没有正确地将索引名设置为(错误?),因此我们可以从类型转换中受益,也可以对此进行更正。

目前,Pandas只有8个字节的整数i8,并且浮点数f8(请参阅本期)。

You can use the to_records method, but have to play around a bit with the dtypes if they are not what you want from the get go. In my case, having copied your DF from a string, the index type is string (represented by an object dtype in pandas):

In [102]: df
Out[102]: 
label    A    B    C
ID                  
1      NaN  0.2  NaN
2      NaN  NaN  0.5
3      NaN  0.2  0.5
4      0.1  0.2  NaN
5      0.1  0.2  0.5
6      0.1  NaN  0.5
7      0.1  NaN  NaN

In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]: 
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Converting the recarray dtype does not work for me, but one can do this in Pandas already:

In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Note that Pandas does not set the name of the index properly (to ID) in the exported record array (a bug?), so we profit from the type conversion to also correct for that.

At the moment Pandas has only 8-byte integers, i8, and floats, f8 (see this issue).


回答 5

似乎df.to_records()会为您工作。您正在寻找的确切功能已被要求to_records指出作为替代。

我使用您的示例在本地进行了尝试,该调用产生的结果与您正在寻找的输出非常相似:

rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)],
      dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])

请注意,这是一个recarray而不是array。您可以通过将其构造函数调用为,将结果移入常规numpy数组np.array(df.to_records())

It seems like df.to_records() will work for you. The exact feature you’re looking for was requested and to_records pointed to as an alternative.

I tried this out locally using your example, and that call yields something very similar to the output you were looking for:

rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)],
      dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])

Note that this is a recarray rather than an array. You could move the result in to regular numpy array by calling its constructor as np.array(df.to_records()).


回答 6

尝试这个:

a = numpy.asarray(df)

Try this:

a = numpy.asarray(df)

回答 7

这是我从pandas DataFrame制作结构数组的方法。

创建数据框

import pandas as pd
import numpy as np
import six

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

定义函数以从pandas DataFrame中创建一个numpy结构数组(而不是记录数组)。

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns

    if six.PY2:  # python 2 needs .encode() but 3 does not
        types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    else:
        types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

使用reset_index使包括索引作为其数据的一部分,新的数据帧。将该数据帧转换为结构数组。

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

编辑:更新df_to_sarray以避免错误调用.encode()与Python 3.感谢约瑟夫·加尔文宁静 为他们的意见和解决方案。

Here is my approach to making a structure array from a pandas DataFrame.

Create the data frame

import pandas as pd
import numpy as np
import six

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

Define function to make a numpy structure array (not a record array) from a pandas DataFrame.

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns

    if six.PY2:  # python 2 needs .encode() but 3 does not
        types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    else:
        types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

Use reset_index to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

EDIT: Updated df_to_sarray to avoid error calling .encode() with python 3. Thanks to Joseph Garvin and halcyon for their comment and solution.


回答 8

将数据帧转换为其Numpy数组表示形式的两种方法。

  • mah_np_array = df.as_matrix(columns=None)

  • mah_np_array = df.values

Doc:https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html

Two ways to convert the data-frame to its Numpy-array representation.

  • mah_np_array = df.as_matrix(columns=None)

  • mah_np_array = df.values

Doc: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html


回答 9

示例DataFrame的更简单方法:

df

         gbm       nnet        reg
0  12.097439  12.047437  12.100953
1  12.109811  12.070209  12.095288
2  11.720734  11.622139  11.740523
3  11.824557  11.926414  11.926527
4  11.800868  11.727730  11.729737
5  12.490984  12.502440  12.530894

采用:

np.array(df.to_records().view(type=np.matrix))

得到:

array([[(0, 12.097439  , 12.047437, 12.10095324),
        (1, 12.10981081, 12.070209, 12.09528824),
        (2, 11.72073428, 11.622139, 11.74052253),
        (3, 11.82455653, 11.926414, 11.92652727),
        (4, 11.80086775, 11.72773 , 11.72973699),
        (5, 12.49098389, 12.50244 , 12.53089367)]],
dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'),
       ('reg', '<f8')]))

A Simpler Way for Example DataFrame:

df

         gbm       nnet        reg
0  12.097439  12.047437  12.100953
1  12.109811  12.070209  12.095288
2  11.720734  11.622139  11.740523
3  11.824557  11.926414  11.926527
4  11.800868  11.727730  11.729737
5  12.490984  12.502440  12.530894

USE:

np.array(df.to_records().view(type=np.matrix))

GET:

array([[(0, 12.097439  , 12.047437, 12.10095324),
        (1, 12.10981081, 12.070209, 12.09528824),
        (2, 11.72073428, 11.622139, 11.74052253),
        (3, 11.82455653, 11.926414, 11.92652727),
        (4, 11.80086775, 11.72773 , 11.72973699),
        (5, 12.49098389, 12.50244 , 12.53089367)]],
dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'),
       ('reg', '<f8')]))

回答 10

从数据框导出到arcgis表时遇到了类似的问题,偶然发现了来自usgs的解决方案(https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table)。简而言之,您的问题具有类似的解决方案:

df

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

np_data = np.array(np.rec.fromrecords(df.values))
np_names = df.dtypes.index.tolist()
np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])

np_data

array([( nan,  0.2,  nan), ( nan,  nan,  0.5), ( nan,  0.2,  0.5),
       ( 0.1,  0.2,  nan), ( 0.1,  0.2,  0.5), ( 0.1,  nan,  0.5),
       ( 0.1,  nan,  nan)], 
      dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))

Just had a similar problem when exporting from dataframe to arcgis table and stumbled on a solution from usgs (https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table). In short your problem has a similar solution:

df

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

np_data = np.array(np.rec.fromrecords(df.values))
np_names = df.dtypes.index.tolist()
np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])

np_data

array([( nan,  0.2,  nan), ( nan,  nan,  0.5), ( nan,  0.2,  0.5),
       ( 0.1,  0.2,  nan), ( 0.1,  0.2,  0.5), ( 0.1,  nan,  0.5),
       ( 0.1,  nan,  nan)], 
      dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))

回答 11

我经历了以上答案。“ as_matrix() ”方法可以使用,但是现在已经过时了。对我来说,有效的是“ .to_numpy() ”。

这将返回一个多维数组。如果您要从excel工作表中读取数据,并且需要从任何索引访问数据,则我会更喜欢使用此方法。希望这可以帮助 :)

I went through the answers above. The “as_matrix()” method works but its obsolete now. For me, What worked was “.to_numpy()“.

This returns a multidimensional array. I’ll prefer using this method if you’re reading data from excel sheet and you need to access data from any index. Hope this helps :)


回答 12

除了陨石的答案,我找到了代码

df.index = df.index.astype('i8')

对我不起作用。因此,我将代码放在这里,以方便陷入此问题的其他人。

city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
# the field 'city_en' is a string, when converted to Numpy array, it will be an object
city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
descr=city_cluster_arr.dtype.descr
# change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
descr[1]=(descr[1][0], "S20")
newArr=city_cluster_arr.astype(np.dtype(descr))

Further to meteore’s answer, I found the code

df.index = df.index.astype('i8')

doesn’t work for me. So I put my code here for the convenience of others stuck with this issue.

city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
# the field 'city_en' is a string, when converted to Numpy array, it will be an object
city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
descr=city_cluster_arr.dtype.descr
# change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
descr[1]=(descr[1][0], "S20")
newArr=city_cluster_arr.astype(np.dtype(descr))

回答 13

将数据帧转换为numpy数组的简单方法:

import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df_to_array = df.to_numpy()
array([[1, 3],
   [2, 4]])

鼓励使用to_numpy来保持一致性。

参考:https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html

A simple way to convert dataframe to numpy array:

import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df_to_array = df.to_numpy()
array([[1, 3],
   [2, 4]])

Use of to_numpy is encouraged to preserve consistency.

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html


回答 14

尝试这个:

np.array(df) 

array([['ID', nan, nan, nan],
   ['1', nan, 0.2, nan],
   ['2', nan, nan, 0.5],
   ['3', nan, 0.2, 0.5],
   ['4', 0.1, 0.2, nan],
   ['5', 0.1, 0.2, 0.5],
   ['6', 0.1, nan, 0.5],
   ['7', 0.1, nan, nan]], dtype=object)

有关更多信息,请访问:[ https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html] 对numpy 1.16.5和pandas 0.25.2有效。

Try this:

np.array(df) 

array([['ID', nan, nan, nan],
   ['1', nan, 0.2, nan],
   ['2', nan, nan, 0.5],
   ['3', nan, 0.2, 0.5],
   ['4', 0.1, 0.2, nan],
   ['5', 0.1, 0.2, 0.5],
   ['6', 0.1, nan, 0.5],
   ['7', 0.1, nan, nan]], dtype=object)

Some more information at: [https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html] Valid for numpy 1.16.5 and pandas 0.25.2.


创建一个空的Pandas DataFrame,然后填充它?

问题:创建一个空的Pandas DataFrame,然后填充它?

我从这里的pandas DataFrame文档开始:http ://pandas.pydata.org/pandas-docs/stable/dsintro.html

我想用时间序列类型的计算中的值迭代地填充DataFrame。所以基本上,我想用列A,B和时间戳记行(全为0或全部为NaN)初始化DataFrame。

然后,我将添加初始值,然后遍历此数据,计算出大约某行之前的新行row[A][t] = row[A][t-1]+1

我目前正在使用下面的代码,但是我觉得这很丑陋,必须有一种直接使用DataFrame进行此操作的方法,或者通常来说是一种更好的方法。注意:我正在使用Python 2.7。

import datetime as dt
import pandas as pd
import scipy as s

if __name__ == '__main__':
    base = dt.datetime.today().date()
    dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
    dates.sort()

    valdict = {}
    symbols = ['A','B', 'C']
    for symb in symbols:
        valdict[symb] = pd.Series( s.zeros( len(dates)), dates )

    for thedate in dates:
        if thedate > dates[0]:
            for symb in valdict:
                valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]

    print valdict

I’m starting from the pandas DataFrame docs here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html

I’d like to iteratively fill the DataFrame with values in a time series kind of calculation. So basically, I’d like to initialize the DataFrame with columns A, B and timestamp rows, all 0 or all NaN.

I’d then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.

I’m currently using the code as below, but I feel it’s kind of ugly and there must be a way to do this with a DataFrame directly, or just a better way in general. Note: I’m using Python 2.7.

import datetime as dt
import pandas as pd
import scipy as s

if __name__ == '__main__':
    base = dt.datetime.today().date()
    dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
    dates.sort()

    valdict = {}
    symbols = ['A','B', 'C']
    for symb in symbols:
        valdict[symb] = pd.Series( s.zeros( len(dates)), dates )

    for thedate in dates:
        if thedate > dates[0]:
            for symb in valdict:
                valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]

    print valdict

回答 0

这里有一些建议:

使用date_range的指标:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

注意:我们可以NaN简单地通过编写以下内容来创建一个空的DataFrame(带有s):

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

要对数据进行这些类型的计算,请使用numpy数组:

data = np.array([np.arange(10)]*3).T

因此,我们可以创建DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9

Here’s a couple of suggestions:

Use date_range for the index:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

Note: we could create an empty DataFrame (with NaNs) simply by writing:

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

To do these type of calculations for the data, use a numpy array:

data = np.array([np.arange(10)]*3).T

Hence we can create the DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9

回答 1

如果您只想创建一个空的数据框并在以后用一些传入的数据框填充它,请尝试以下操作:

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional 

在此示例中,我使用此pandas文档创建一个新的数据框,然后使用append将oldDF中的数据写入newDF。

如果我必须不断地将来自多个旧DF的新数据追加到此newDF中,则只需使用for循环即可遍历 pandas.DataFrame.append()

If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional 

In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.

If I have to keep appending new data into this newDF from more than one oldDFs, I just use a for loop to iterate over pandas.DataFrame.append()


回答 2

创建数据框的正确方法

TLDR;(只需阅读粗体文字)

这里的大多数答案将告诉您如何创建一个空的DataFrame并将其填写,但是没有人会告诉您这是一件坏事。

这是我的建议:等待直到您确定拥有所有需要使用的数据。使用列表收集数据,然后在准备好时初始化DataFrame。

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

一次添加到列表并创建一个DataFrame总是比创建一个空的DataFrame(或NaN之一)便宜,一遍又一遍地添加到列表列表还占用较少的内存,并且可以轻松处理,追加和删除(如果需要)的数据结构

此方法的另一个优点是dtypes可以自动推断(而不是分配object给所有对象)。

最后一个优点是为您的数据自动创建了aRangeIndex,因此不必担心(只需查看下面的劣势appendloc方法,您将在两种方法中看到需要正确处理索引的元素)。


你不应该做的事情

appendconcat在循环内

这是我从初学者看到的最大错误:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

内存重新分配给每一个appendconcat你有操作。再加上一个循环,就可以进行二次复杂度运算。从df.append文档页面

迭代地将行添加到DataFrame可能比单个连接更多地占用大量计算资源。更好的解决方案是将这些行添加到列表中,然后一次将列表与原始DataFrame连接起来。

与之相关的另一个错误df.append是用户倾向于忘记append不是就地函数,因此必须将结果分配回去。您还必须担心dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

处理对象列从来都不是一件好事,因为熊猫无法向量化这些列上的操作。您将需要执行以下操作来修复它:

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

loc 循环内

我还曾经看到过loc将其追加到创建为空的DataFrame上:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

和以前一样,您没有每次都预先分配所需的内存量,因此每次创建新行时都会重新增加内存。就像一样糟糕append,甚至更难看。

NaN的空数据框

然后,创建一个NaN的DataFrame以及与此相关的所有警告。

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

它会像其他对象一样创建一个对象列的DataFrame。

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

如上所述,追加仍然存在所有问题。

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]

证明在布丁中

对这些方法进行计时是最快的方法,以了解它们在内存和实用性方面的差异。

在此处输入图片说明

基准测试代码,以供参考。

The Right Way™ to Create a DataFrame

TLDR; (just read the bold text)

Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.

Here is my advice: Wait until you are sure you have all the data you need to work with. Use a list to collect your data, then initialise a DataFrame when you are ready.

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of of NaNs) and append to it over and over again. Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).

The other advantage of this method is dtypes are automatically inferred (rather than assigning object to all of them).

The last advantage is that a RangeIndex is automatically created for your data, so it is one less thing to worry about (take a look at the poor append and loc methods below, you will see elements in both that require handling the index appropriately).


Things you should NOT do

append or concat inside a loop

Here is the biggest mistake I’ve seen from beginners:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation. From the df.append doc page:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

loc inside a loop

I have also seen loc used to append to a DataFrame that was created empty:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It’s just as bad as append, and even more ugly.

Empty DataFrame of NaNs

And then, there’s creating a DataFrame of NaNs, and all the caveats associated therewith.

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

It creates a DataFrame of object columns, like the others.

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

Appending still has all the issues as the methods above.

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.


回答 3

用列名初始化空框架

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

将新记录添加到框架

my_df.loc[len(my_df)] = [2, 4, 5]

您可能还想通过字典:

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic 

将另一个框架附加到现有框架

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)

性能考量

如果要在循环内添加行,请考虑性能问题。对于大约前1000条记录,“ my_df.loc”的性能较好,但通过增加循环中的记录数,它的性能逐渐变慢。

如果您打算在一个大循环中进行细化处理(例如10M‌条记录左右),那么最好将这两种方法混合使用;用iloc填充数据框,直到大小达到1000,然后将其附加到原始数据框,然后清空临时数据框。这将使您的性能提高大约10倍。

Initialize empty frame with column names

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

Add a new record to a frame

my_df.loc[len(my_df)] = [2, 4, 5]

You also might want to pass a dictionary:

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic 

Append another frame to your existing frame

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)

Performance considerations

If you are adding rows inside a loop consider performance issues. For around the first 1000 records “my_df.loc” performance is better, but it gradually becomes slower by increasing the number of records in the loop.

If you plan to do thins inside a big loop (say 10M‌ records or so), you are better off using a mixture of these two; fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe. This would boost your performance by around 10 times.


回答 4

假设有19行的数据框

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

保持A列不变

test['A']=10

保持列b为循环给出的变量

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

您可以将第一个x替换为pd.Series([x], index = [x])任何值

Assume a dataframe with 19 rows

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

Keeping Column A as a constant

test['A']=10

Keeping column b as a variable given by a loop

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

You can replace the first x in pd.Series([x], index = [x]) with any value


如何计算pandas DataFrame列中的NaN值

问题:如何计算pandas DataFrame列中的NaN值

我有数据,我想在其中找到数量NaN,以便如果它小于某个阈值,我将删除此列。我看了一下,但是找不到任何功能。有value_counts,但对我来说会很慢,因为大多数值是不同的,并且我只想计数NaN

I have data, in which I want to find number of NaN, so that if it is less than some threshold, I will drop this columns. I looked, but didn’t able to find any function for this. there is value_counts, but it would be slow for me, because most of values are distinct and I want count of NaN only.


回答 0

您可以使用该isna()方法(或者它的别名isnull()也与<0.21.0的旧版熊猫兼容),然后求和以计算NaN值。对于一列:

In [1]: s = pd.Series([1,2,3, np.nan, np.nan])

In [4]: s.isna().sum()   # or s.isnull().sum() for older pandas versions
Out[4]: 2

对于几列,它也适用:

In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

In [6]: df.isna().sum()
Out[6]:
a    1
b    2
dtype: int64

You can use the isna() method (or it’s alias isnull() which is also compatible with older pandas versions < 0.21.0) and then sum to count the NaN values. For one column:

In [1]: s = pd.Series([1,2,3, np.nan, np.nan])

In [4]: s.isna().sum()   # or s.isnull().sum() for older pandas versions
Out[4]: 2

For several columns, it also works:

In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

In [6]: df.isna().sum()
Out[6]:
a    1
b    2
dtype: int64

回答 1

您可以从非Nan值的计数中减去总长度:

count_nan = len(df) - df.count()

您应该在数据上计时。与isnull解决方案相比,小型系列的速度提高了3倍。

You could subtract the total length from the count of non-nan values:

count_nan = len(df) - df.count()

You should time it on your data. For small Series got a 3x speed up in comparison with the isnull solution.


回答 2

假设df是一个熊猫DataFrame。

然后,

df.isnull().sum(axis = 0)

这将在每列中提供NaN值的数量。

如果需要,可以在每行中输入NaN值,

df.isnull().sum(axis = 1)

Lets assume df is a pandas DataFrame.

Then,

df.isnull().sum(axis = 0)

This will give number of NaN values in every column.

If you need, NaN values in every row,

df.isnull().sum(axis = 1)

回答 3

根据投票最多的答案,我们可以轻松定义一个函数,该函数为我们提供一个数据框,以预览每列中的缺失值和缺失值的百分比:

def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

Based on the most voted answer we can easily define a function that gives us a dataframe to preview the missing values and the % of missing values in each column:

def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

回答 4

从熊猫0.14.1开始,我在这里建议 value_counts方法中使用关键字参数:

import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
for col in df:
    print df[col].value_counts(dropna=False)

2     1
 1     1
NaN    1
dtype: int64
NaN    2
 1     1
dtype: int64

Since pandas 0.14.1 my suggestion here to have a keyword argument in the value_counts method has been implemented:

import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
for col in df:
    print df[col].value_counts(dropna=False)

2     1
 1     1
NaN    1
dtype: int64
NaN    2
 1     1
dtype: int64

回答 5

如果它只是在熊猫列中计算nan值,这是一种快速方法

import pandas as pd
## df1 as an example data frame 
## col1 name of column for which you want to calculate the nan values
sum(pd.isnull(df1['col1']))

if its just counting nan values in a pandas column here is a quick way

import pandas as pd
## df1 as an example data frame 
## col1 name of column for which you want to calculate the nan values
sum(pd.isnull(df1['col1']))

回答 6

如果您正在使用Jupyter Notebook,如何…。

 %%timeit
 df.isnull().any().any()

要么

 %timeit 
 df.isnull().values.sum()

或者,数据中是否存在NaN,如果是,在哪里?

 df.isnull().any()

if you are using Jupyter Notebook, How about….

 %%timeit
 df.isnull().any().any()

or

 %timeit 
 df.isnull().values.sum()

or, are there anywhere NaNs in the data, if yes, where?

 df.isnull().any()

回答 7

下面将按降序打印所有Nan列。

df.isnull().sum().sort_values(ascending = False)

要么

下面将按降序打印前15 Nan列。

df.isnull().sum().sort_values(ascending = False).head(15)

The below will print all the Nan columns in descending order.

df.isnull().sum().sort_values(ascending = False)

or

The below will print first 15 Nan columns in descending order.

df.isnull().sum().sort_values(ascending = False).head(15)

回答 8

import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])

results 
'''
  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0
'''

您可以使用以下功能,这将在Dataframe中提供输出

  • 零值
  • 缺失值
  • 占总价值的百分比
  • 总零缺失值
  • 总零缺失值百分比
  • 数据类型

只需复制并粘贴以下函数,然后通过传递您的pandas Dataframe来调用它

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

输出量

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

如果要保持简单,则可以使用以下函数获取%的缺失值

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)
'''
Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64
'''
import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])

results 
'''
  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0
'''

You can use following function, which will give you output in Dataframe

  • Zero Values
  • Missing Values
  • % of Total Values
  • Total Zero Missing Values
  • % Total Zero Missing Values
  • Data Type

Just copy and paste following function and call it by passing your pandas Dataframe

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

Output

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

If you want to keep it simple then you can use following function to get missing values in %

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)
'''
Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64
'''

回答 9

计数零:

df[df == 0].count(axis=0)

要计算NaN:

df.isnull().sum()

要么

df.isna().sum()

To count zeroes:

df[df == 0].count(axis=0)

To count NaN:

df.isnull().sum()

or

df.isna().sum()

回答 10

您可以使用value_counts方法并打印np.nan的值

s.value_counts(dropna = False)[np.nan]

You can use value_counts method and print values of np.nan

s.value_counts(dropna = False)[np.nan]

回答 11

请在下面使用特定的列数

dataframe.columnName.isnull().sum()

Please use below for particular column count

dataframe.columnName.isnull().sum()

回答 12

df1.isnull().sum()

这将达到目的。

df1.isnull().sum()

This will do the trick.


回答 13

这是用于按Null列计算值的代码:

df.isna().sum()

Here is the code for counting Null values column wise :

df.isna().sum()

回答 14

2017年7月有一篇不错的Dzone文章,其中详细介绍了总结NaN值的各种方法。检查它在这里

我引用的文章通过以下方式提供了附加值:(1)显示一种计数和显示每一列的NaN计数的方法,以便人们可以轻松地决定是否丢弃这些列,以及(2)演示一种在其中选择那些行的方法具有NaN的特定分子,因此可以有选择地丢弃或估算它们。

这是一个演示该方法实用性的简单示例-仅用几列,也许它的用处并不明显,但我发现它对较大的数据帧有帮助。

import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# Check whether there are null values in columns
null_columns = df.columns[df.isnull().any()]
print(df[null_columns].isnull().sum())

# One can follow along further per the cited article

There is a nice Dzone article from July 2017 which details various ways of summarising NaN values. Check it out here.

The article I have cited provides additional value by: (1) Showing a way to count and display NaN counts for every column so that one can easily decide whether or not to discard those columns and (2) Demonstrating a way to select those rows in specific which have NaNs so that they may be selectively discarded or imputed.

Here’s a quick example to demonstrate the utility of the approach – with only a few columns perhaps its usefulness is not obvious but I found it to be of help for larger data-frames.

import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# Check whether there are null values in columns
null_columns = df.columns[df.isnull().any()]
print(df[null_columns].isnull().sum())

# One can follow along further per the cited article

回答 15

为了计算NaN,尚未建议的另一个简单选项是添加形状以返回带有NaN的行数。

df[df['col_name'].isnull()]['col_name'].shape

One other simple option not suggested yet, to just count NaNs, would be adding in the shape to return the number of rows with NaN.

df[df['col_name'].isnull()]['col_name'].shape

回答 16

df.isnull()。sum()将给出缺失值的列式总和。

如果您想知道特定列中缺失值的总和,则可以使用以下代码df.column.isnull()。sum()

df.isnull().sum() will give the column-wise sum of missing values.

If you want to know the sum of missing values in a particular column then following code will work df.column.isnull().sum()


回答 17

根据给出的答案和一些改进,这是我的方法

def PercentageMissin(Dataset):
    """this function will return the percentage of missing values in a dataset """
    if isinstance(Dataset,pd.DataFrame):
        adict={} #a dictionary conatin keys columns names and values percentage of missin value in the columns
        for col in Dataset.columns:
            adict[col]=(np.count_nonzero(Dataset[col].isnull())*100)/len(Dataset[col])
        return pd.DataFrame(adict,index=['% of missing'],columns=adict.keys())
    else:
        raise TypeError("can only be used with panda dataframe")

based to the answer that was given and some improvements this is my approach

def PercentageMissin(Dataset):
    """this function will return the percentage of missing values in a dataset """
    if isinstance(Dataset,pd.DataFrame):
        adict={} #a dictionary conatin keys columns names and values percentage of missin value in the columns
        for col in Dataset.columns:
            adict[col]=(np.count_nonzero(Dataset[col].isnull())*100)/len(Dataset[col])
        return pd.DataFrame(adict,index=['% of missing'],columns=adict.keys())
    else:
        raise TypeError("can only be used with panda dataframe")

回答 18

如果您需要获取groupby提取的不同组之间的非NA(non-None)和NA(None)计数:

gdf = df.groupby(['ColumnToGroupBy'])

def countna(x):
    return (x.isna()).sum()

gdf.agg(['count', countna, 'size'])

这将返回非NA,NA的计数以及每个组的条目总数。

In case you need to get the non-NA (non-None) and NA (None) counts across different groups pulled out by groupby:

gdf = df.groupby(['ColumnToGroupBy'])

def countna(x):
    return (x.isna()).sum()

gdf.agg(['count', countna, 'size'])

This returns the counts of non-NA, NA and total number of entries per group.


回答 19

在我的代码中使用了@sushmit提出的解决方案。

相同的可能变体也可以是

colNullCnt = []
for z in range(len(df1.cols)):
    colNullCnt.append([df1.cols[z], sum(pd.isnull(trainPd[df1.cols[z]]))])

这样做的好处是,此后它将返回df中每个列的结果。

Used the solution proposed by @sushmit in my code.

A possible variation of the same can also be

colNullCnt = []
for z in range(len(df1.cols)):
    colNullCnt.append([df1.cols[z], sum(pd.isnull(trainPd[df1.cols[z]]))])

Advantage of this is that it returns the result for each of the columns in the df henceforth.


回答 20

import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# count the NaNs in a column
num_nan_a = df.loc[ (pd.isna(df['a'])) , 'a' ].shape[0]
num_nan_b = df.loc[ (pd.isna(df['b'])) , 'b' ].shape[0]

# summarize the num_nan_b
print(df)
print(' ')
print(f"There are {num_nan_a} NaNs in column a")
print(f"There are {num_nan_b} NaNs in column b")

给出作为输出:

     a    b
0  1.0  NaN
1  2.0  1.0
2  NaN  NaN

There are 1 NaNs in column a
There are 2 NaNs in column b
import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# count the NaNs in a column
num_nan_a = df.loc[ (pd.isna(df['a'])) , 'a' ].shape[0]
num_nan_b = df.loc[ (pd.isna(df['b'])) , 'b' ].shape[0]

# summarize the num_nan_b
print(df)
print(' ')
print(f"There are {num_nan_a} NaNs in column a")
print(f"There are {num_nan_b} NaNs in column b")

Gives as output:

     a    b
0  1.0  NaN
1  2.0  1.0
2  NaN  NaN

There are 1 NaNs in column a
There are 2 NaNs in column b

回答 21

假设您要在称为评论的数据框中获取称为价格的列(系列)中的缺失值(NaN)数

#import the dataframe
import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

要获取缺失值,以n_missing_prices作为变量,只需执行

n_missing_prices = sum(reviews.price.isnull())
print(n_missing_prices)

sum是这里的关键方法,在我意识到sum是在这种情况下使用的正确方法之前,我曾尝试使用count

Suppose you want to get the number of missing values(NaN) in a column(series) known as price in a dataframe called reviews

#import the dataframe
import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

To get the missing values, with n_missing_prices as the variable, simple do

n_missing_prices = sum(reviews.price.isnull())
print(n_missing_prices)

sum is the key method here, was trying to use count before i realized sum is the right method to use in this context


回答 22

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.count.html#pandas.Series.count

pandas.Series.count
Series.count(level=None)[source]

返回系列中非NA /空观测值的数量

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.count.html#pandas.Series.count

pandas.Series.count
Series.count(level=None)[source]

Return number of non-NA/null observations in the Series


回答 23

对于您的任务,您可以使用pandas.DataFrame.dropna(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 2, 3, 4, np.nan],
                   'b': [1, 2, np.nan, 4, np.nan],
                   'c': [np.nan, 2, np.nan, 4, np.nan]})
df = df.dropna(axis='columns', thresh=3)

print(df)

您可以使用thresh thresh参数为DataFrame中的所有列声明NaN值的最大计数。

代码输出:

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  NaN
3  4.0  4.0
4  NaN  NaN

For your task you can use pandas.DataFrame.dropna (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 2, 3, 4, np.nan],
                   'b': [1, 2, np.nan, 4, np.nan],
                   'c': [np.nan, 2, np.nan, 4, np.nan]})
df = df.dropna(axis='columns', thresh=3)

print(df)

Whith thresh parameter you can declare the max count for NaN values for all columns in DataFrame.

Code outputs:

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  NaN
3  4.0  4.0
4  NaN  NaN

如何在熊猫数据框的列中将所有NaN值替换为零

问题:如何在熊猫数据框的列中将所有NaN值替换为零

我有一个数据框如下

      itm Date                  Amount 
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN

当我尝试将一个函数应用于“金额”列时,出现以下错误。

ValueError: cannot convert float NaN to integer

我已经尝试过使用数学模块中的.isnan来应用函数。我已经尝试过pandas .replace属性。我已经尝试过pandas 0.9的.sparse data属性。我还尝试过如果函数中的NaN == NaN语句。我还看了这篇文章如何在R数据帧中用零替换NA值?同时查看其他文章。我尝试过的所有方法均无效或无法识别NaN。任何提示或解决方案将不胜感激。

I have a dataframe as below

      itm Date                  Amount 
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN

when I try to .apply a function to the Amount column I get the following error.

ValueError: cannot convert float NaN to integer

I have tried applying a function using .isnan from the Math Module I have tried the pandas .replace attribute I tried the .sparse data attribute from pandas 0.9 I have also tried if NaN == NaN statement in a function. I have also looked at this article How do I replace NA values with zeros in an R dataframe? whilst looking at some other articles. All the methods I have tried have not worked or do not recognise NaN. Any Hints or solutions would be appreciated.


回答 0

我相信DataFrame.fillna()会为您做到这一点。

链接到文档以获取数据框系列

例:

In [7]: df
Out[7]: 
          0         1
0       NaN       NaN
1 -0.494375  0.570994
2       NaN       NaN
3  1.876360 -0.229738
4       NaN       NaN

In [8]: df.fillna(0)
Out[8]: 
          0         1
0  0.000000  0.000000
1 -0.494375  0.570994
2  0.000000  0.000000
3  1.876360 -0.229738
4  0.000000  0.000000

要仅将NaN填入一列,请仅选择该列。在这种情况下,我使用inplace = True实际更改df的内容。

In [12]: df[1].fillna(0, inplace=True)
Out[12]: 
0    0.000000
1    0.570994
2    0.000000
3   -0.229738
4    0.000000
Name: 1

In [13]: df
Out[13]: 
          0         1
0       NaN  0.000000
1 -0.494375  0.570994
2       NaN  0.000000
3  1.876360 -0.229738
4       NaN  0.000000

编辑:

为避免出现SettingWithCopyWarning,请使用内置的列专用功能:

df.fillna({1:0}, inplace=True)

I believe DataFrame.fillna() will do this for you.

Link to Docs for a dataframe and for a Series.

Example:

In [7]: df
Out[7]: 
          0         1
0       NaN       NaN
1 -0.494375  0.570994
2       NaN       NaN
3  1.876360 -0.229738
4       NaN       NaN

In [8]: df.fillna(0)
Out[8]: 
          0         1
0  0.000000  0.000000
1 -0.494375  0.570994
2  0.000000  0.000000
3  1.876360 -0.229738
4  0.000000  0.000000

To fill the NaNs in only one column, select just that column. in this case I’m using inplace=True to actually change the contents of df.

In [12]: df[1].fillna(0, inplace=True)
Out[12]: 
0    0.000000
1    0.570994
2    0.000000
3   -0.229738
4    0.000000
Name: 1

In [13]: df
Out[13]: 
          0         1
0       NaN  0.000000
1 -0.494375  0.570994
2       NaN  0.000000
3  1.876360 -0.229738
4       NaN  0.000000

EDIT:

To avoid a SettingWithCopyWarning, use the built in column-specific functionality:

df.fillna({1:0}, inplace=True)

回答 1

不能保证切片会返回视图或副本。你可以做

df['column'] = df['column'].fillna(value)

It is not guaranteed that the slicing returns a view or a copy. You can do

df['column'] = df['column'].fillna(value)

回答 2

您可以使用replace更改NaN0

import pandas as pd
import numpy as np

# for column
df['column'] = df['column'].replace(np.nan, 0)

# for whole dataframe
df = df.replace(np.nan, 0)

# inplace
df.replace(np.nan, 0, inplace=True)

You could use replace to change NaN to 0:

import pandas as pd
import numpy as np

# for column
df['column'] = df['column'].replace(np.nan, 0)

# for whole dataframe
df = df.replace(np.nan, 0)

# inplace
df.replace(np.nan, 0, inplace=True)

回答 3

我只是想提供一些更新/特殊情况,因为看起来人们仍然来这里。如果您使用的是多索引或以其他方式使用索引切片器,则inplace = True选项可能不足以更新您选择的切片。例如,在2×2级多索引中,这不会更改任何值(从熊猫0.15开始):

idx = pd.IndexSlice
df.loc[idx[:,mask_1],idx[mask_2,:]].fillna(value=0,inplace=True)

“问题”是链接中断了fillna更新原始数据帧的能力。我将“问题”用引号引起来,因为设计决策有充分的理由导致在某些情况下无法通过这些链条进行解释。同样,这是一个复杂的示例(尽管我确实遇到过),但是根据切片的方式,同样的情况可能适用于较少级别的索引。

解决方案是DataFrame.update:

df.update(df.loc[idx[:,mask_1],idx[[mask_2],:]].fillna(value=0))

这是一行,读起来相当好(某种),并消除了中间变量或循环的不必要混乱,同时允许您将fillna应用于所需的任何多层次切片!

如果有人可以找到行不通的地方,请在评论中发帖,我一直在弄乱它并查看源代码,它似乎至少解决了我的多索引切片问题。

I just wanted to provide a bit of an update/special case since it looks like people still come here. If you’re using a multi-index or otherwise using an index-slicer the inplace=True option may not be enough to update the slice you’ve chosen. For example in a 2×2 level multi-index this will not change any values (as of pandas 0.15):

idx = pd.IndexSlice
df.loc[idx[:,mask_1],idx[mask_2,:]].fillna(value=0,inplace=True)

The “problem” is that the chaining breaks the fillna ability to update the original dataframe. I put “problem” in quotes because there are good reasons for the design decisions that led to not interpreting through these chains in certain situations. Also, this is a complex example (though I really ran into it), but the same may apply to fewer levels of indexes depending on how you slice.

The solution is DataFrame.update:

df.update(df.loc[idx[:,mask_1],idx[[mask_2],:]].fillna(value=0))

It’s one line, reads reasonably well (sort of) and eliminates any unnecessary messing with intermediate variables or loops while allowing you to apply fillna to any multi-level slice you like!

If anybody can find places this doesn’t work please post in the comments, I’ve been messing with it and looking at the source and it seems to solve at least my multi-index slice problems.


回答 4

下面的代码为我工作。

import pandas

df = pandas.read_csv('somefile.txt')

df = df.fillna(0)

The below code worked for me.

import pandas

df = pandas.read_csv('somefile.txt')

df = df.fillna(0)

回答 5

填充缺失值的简单方法:

填充 字符串列:当字符串列具有缺失值和NaN值时。

df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)

填充 数字列:当数字列缺少值和NaN值时。

df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)

用零填充NaN:

df['column name'].fillna(0, inplace = True)

Easy way to fill the missing values:-

filling string columns: when string columns have missing values and NaN values.

df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)

filling numeric columns: when the numeric columns have missing values and NaN values.

df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)

filling NaN with zero:

df['column name'].fillna(0, inplace = True)

回答 6

您还可以使用字典来填充DataFrame中特定列的NaN值,而不是使用某个oneValue来填充所有DF。

import pandas as pd

df = pd.read_excel('example.xlsx')
df.fillna( {
        'column1': 'Write your values here',
        'column2': 'Write your values here',
        'column3': 'Write your values here',
        'column4': 'Write your values here',
        .
        .
        .
        'column-n': 'Write your values here'} , inplace=True)

You can also use dictionaries to fill NaN values of the specific columns in the DataFrame rather to fill all the DF with some oneValue.

import pandas as pd

df = pd.read_excel('example.xlsx')
df.fillna( {
        'column1': 'Write your values here',
        'column2': 'Write your values here',
        'column3': 'Write your values here',
        'column4': 'Write your values here',
        .
        .
        .
        'column-n': 'Write your values here'} , inplace=True)

回答 7

在此处输入图片说明

考虑到Amount上表中的特定列是整数类型。以下是一个解决方案:

df['Amount'] = df.Amount.fillna(0).astype(int)

同样,你可以用不同的数据类型,如填充它floatstr等等。

特别是,我会考虑使用数据类型来比较同一列的各种值。

enter image description here

Considering the particular column Amount in the above table is of integer type. The following would be a solution :

df['Amount'] = df.Amount.fillna(0).astype(int)

Similarly, you can fill it with various data types like float, str and so on.

In particular, I would consider datatype to compare various values of the same column.


回答 8

替换熊猫中的na值

df['column_name'].fillna(value_to_be_replaced,inplace=True)

如果为inplace = False,则不更新df(数据帧),而是返回修改后的值。

To replace na values in pandas

df['column_name'].fillna(value_to_be_replaced,inplace=True)

if inplace = False, instead of updating the df (dataframe) it will return the modified values.


回答 9

如果要将其转换为pandas数据框,也可以使用来完成此操作fillna

import numpy as np
df=np.array([[1,2,3, np.nan]])

import pandas as pd
df=pd.DataFrame(df)
df.fillna(0)

这将返回以下内容:

     0    1    2   3
0  1.0  2.0  3.0 NaN
>>> df.fillna(0)
     0    1    2    3
0  1.0  2.0  3.0  0.0

If you were to convert it to a pandas dataframe, you can also accomplish this by using fillna.

import numpy as np
df=np.array([[1,2,3, np.nan]])

import pandas as pd
df=pd.DataFrame(df)
df.fillna(0)

This will return the following:

     0    1    2   3
0  1.0  2.0  3.0 NaN
>>> df.fillna(0)
     0    1    2    3
0  1.0  2.0  3.0  0.0

回答 10

主要有两个选项:插补或填充缺失值的情况下NaN / np.nan,仅数字替换(跨列:

df['Amount'].fillna(value=None, method= ,axis=1,) 足够了:

从文档中:

value:标量,dict,Series或DataFrame用于填充孔的值(例如0),或者是dict / Series / DataFrame的值,这些值指定每个索引(对于Series)或列(对于DataFrame)使用哪个值。(不在dict / Series / DataFrame中的值将不被填充)。该值不能是列表。

这意味着不再允许对“字符串”或“常量”进行插补。

对于更专业的插补,请使用SimpleImputer()

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])

There are two options available primarily; in case of imputation or filling of missing values NaN / np.nan with only numerical replacements (across column(s):

df['Amount'].fillna(value=None, method= ,axis=1,) is sufficient:

From the Documentation:

value : scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

Which means ‘strings’ or ‘constants’ are no longer permissable to be imputed.

For more specialized imputations use SimpleImputer():

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])


回答 11

用不同的方式替换不同列中的nan:

   replacement= {'column_A': 0, 'column_B': -999, 'column_C': -99999}
   df.fillna(value=replacement)

To replace nan in different columns with different ways:

   replacement= {'column_A': 0, 'column_B': -999, 'column_C': -99999}
   df.fillna(value=replacement)

如何将熊猫数据框的索引转换为列?

问题:如何将熊猫数据框的索引转换为列?

这似乎很明显,但是我似乎无法弄清楚如何将数据帧的索引转换为列?

例如:

df=
        gi       ptt_loc
 0  384444683      593  
 1  384444684      594 
 2  384444686      596  

至,

df=
    index1    gi       ptt_loc
 0  0     384444683      593  
 1  1     384444684      594 
 2  2     384444686      596  

This seems rather obvious, but I can’t seem to figure out how to convert an index of data frame to a column?

For example:

df=
        gi       ptt_loc
 0  384444683      593  
 1  384444684      594 
 2  384444686      596  

To,

df=
    index1    gi       ptt_loc
 0  0     384444683      593  
 1  1     384444684      594 
 2  2     384444686      596  

回答 0

要么:

df['index1'] = df.index

.reset_index

df.reset_index(level=0, inplace=True)

因此,如果您有一个3级索引的多索引框架,例如:

>>> df
                       val
tick       tag obs        
2016-02-26 C   2    0.0139
2016-02-27 A   2    0.5577
2016-02-28 C   6    0.0303

并且要将索引中的第1级(tick)和第3级(obs)转换为列,您可以执行以下操作:

>>> df.reset_index(level=['tick', 'obs'])
          tick  obs     val
tag                        
C   2016-02-26    2  0.0139
A   2016-02-27    2  0.5577
C   2016-02-28    6  0.0303

either:

df['index1'] = df.index

or, .reset_index:

df.reset_index(level=0, inplace=True)

so, if you have a multi-index frame with 3 levels of index, like:

>>> df
                       val
tick       tag obs        
2016-02-26 C   2    0.0139
2016-02-27 A   2    0.5577
2016-02-28 C   6    0.0303

and you want to convert the 1st (tick) and 3rd (obs) levels in the index into columns, you would do:

>>> df.reset_index(level=['tick', 'obs'])
          tick  obs     val
tag                        
C   2016-02-26    2  0.0139
A   2016-02-27    2  0.5577
C   2016-02-28    6  0.0303

回答 1

对于MultiIndex,您可以使用以下方法提取其子索引

df['si_name'] = R.index.get_level_values('si_name') 

si_name索引的名称在哪里。

For MultiIndex you can extract its subindex using

df['si_name'] = R.index.get_level_values('si_name') 

where si_name is the name of the subindex.


回答 2

为了更加清楚,让我们看一下索引中具有两个级别的DataFrame(一个MultiIndex)。

index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'], 
                                    ['North', 'South']], 
                                   names=['State', 'Direction'])

df = pd.DataFrame(index=index, 
                  data=np.random.randint(0, 10, (6,4)), 
                  columns=list('abcd'))

在此处输入图片说明

reset_index使用默认参数调用的方法将所有索引级别转换为列,并使用简单RangeIndex的新索引。

df.reset_index()

在此处输入图片说明

使用level参数控制将哪些索引级别转换为列。如果可能,请使用更明确的级别名称。如果没有级别名称,则可以通过其整数位置来引用每个级别,整数位置从外部开始为0。您可以在此处使用标量值或要重置的所有索引的列表。

df.reset_index(level='State') # same as df.reset_index(level=0)

在此处输入图片说明

在极少数情况下,您想要保留索引并将索引转换为列,可以执行以下操作:

# for a single level
df.assign(State=df.index.get_level_values('State'))

# for all levels
df.assign(**df.index.to_frame())

To provide a bit more clarity, let’s look at a DataFrame with two levels in its index (a MultiIndex).

index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'], 
                                    ['North', 'South']], 
                                   names=['State', 'Direction'])

df = pd.DataFrame(index=index, 
                  data=np.random.randint(0, 10, (6,4)), 
                  columns=list('abcd'))

enter image description here

The reset_index method, called with the default parameters, converts all index levels to columns and uses a simple RangeIndex as new index.

df.reset_index()

enter image description here

Use the level parameter to control which index levels are converted into columns. If possible, use the level name, which is more explicit. If there are no level names, you can refer to each level by its integer location, which begin at 0 from the outside. You can use a scalar value here or a list of all the indexes you would like to reset.

df.reset_index(level='State') # same as df.reset_index(level=0)

enter image description here

In the rare event that you want to preserve the index and turn the index into a column, you can do the following:

# for a single level
df.assign(State=df.index.get_level_values('State'))

# for all levels
df.assign(**df.index.to_frame())

回答 3

rename_axis + reset_index

您可以先将索引重命名为所需的标签,然后提升为一系列:

df = df.rename_axis('index1').reset_index()

print(df)

   index1         gi  ptt_loc
0       0  384444683      593
1       1  384444684      594
2       2  384444686      596

这也适用于MultiIndex数据框:

print(df)
#                        val
# tick       tag obs        
# 2016-02-26 C   2    0.0139
# 2016-02-27 A   2    0.5577
# 2016-02-28 C   6    0.0303

df = df.rename_axis(['index1', 'index2', 'index3']).reset_index()

print(df)

       index1 index2  index3     val
0  2016-02-26      C       2  0.0139
1  2016-02-27      A       2  0.5577
2  2016-02-28      C       6  0.0303

rename_axis + reset_index

You can first rename your index to a desired label, then elevate to a series:

df = df.rename_axis('index1').reset_index()

print(df)

   index1         gi  ptt_loc
0       0  384444683      593
1       1  384444684      594
2       2  384444686      596

This works also for MultiIndex dataframes:

print(df)
#                        val
# tick       tag obs        
# 2016-02-26 C   2    0.0139
# 2016-02-27 A   2    0.5577
# 2016-02-28 C   6    0.0303

df = df.rename_axis(['index1', 'index2', 'index3']).reset_index()

print(df)

       index1 index2  index3     val
0  2016-02-26      C       2  0.0139
1  2016-02-27      A       2  0.5577
2  2016-02-28      C       6  0.0303

回答 4

如果要使用该reset_index方法并保留现有索引,则应使用:

df.reset_index().set_index('index', drop=False)

或更改它的位置:

df.reset_index(inplace=True)
df.set_index('index', drop=False, inplace=True)

例如:

print(df)
          gi  ptt_loc
0  384444683      593
4  384444684      594
9  384444686      596

print(df.reset_index())
   index         gi  ptt_loc
0      0  384444683      593
1      4  384444684      594
2      9  384444686      596

print(df.reset_index().set_index('index', drop=False))
       index         gi  ptt_loc
index
0          0  384444683      593
4          4  384444684      594
9          9  384444686      596

如果要摆脱索引标签,可以执行以下操作:

df2 = df.reset_index().set_index('index', drop=False)
df2.index.name = None
print(df2)
   index         gi  ptt_loc
0      0  384444683      593
4      4  384444684      594
9      9  384444686      596

If you want to use the reset_index method and also preserve your existing index you should use:

df.reset_index().set_index('index', drop=False)

or to change it in place:

df.reset_index(inplace=True)
df.set_index('index', drop=False, inplace=True)

For example:

print(df)
          gi  ptt_loc
0  384444683      593
4  384444684      594
9  384444686      596

print(df.reset_index())
   index         gi  ptt_loc
0      0  384444683      593
1      4  384444684      594
2      9  384444686      596

print(df.reset_index().set_index('index', drop=False))
       index         gi  ptt_loc
index
0          0  384444683      593
4          4  384444684      594
9          9  384444686      596

And if you want to get rid of the index label you can do:

df2 = df.reset_index().set_index('index', drop=False)
df2.index.name = None
print(df2)
   index         gi  ptt_loc
0      0  384444683      593
4      4  384444684      594
9      9  384444686      596

回答 5

df1 = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
p = df1.index.values
df1.insert( 0, column="new",value = p)
df1

    new     gi     ptt
0    0      232    342
1    1      66     56 
2    2      34     662
3    3      43     123
df1 = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
p = df1.index.values
df1.insert( 0, column="new",value = p)
df1

    new     gi     ptt
0    0      232    342
1    1      66     56 
2    2      34     662
3    3      43     123

回答 6

一种简单的方法是使用reset_index()方法。对于数据帧df,请使用以下代码:

df.reset_index(inplace=True)

这样,索引将成为一列,并且通过使用inplace作为True,这将成为永久更改。

A very simple way of doing this is to use reset_index() method.For a data frame df use the code below:

df.reset_index(inplace=True)

This way, the index will become a column, and by using inplace as True,this become permanent change.


从熊猫DataFrame中按部分字符串选择

问题:从熊猫DataFrame中按部分字符串选择

我有一个DataFrame4列,其中2个包含字符串值。我想知道是否有一种方法可以根据针对特定列的部分字符串匹配来选择行?

换句话说,一个函数或lambda函数将执行以下操作

re.search(pattern, cell_in_question) 

返回一个布尔值。我熟悉的语法,df[df['A'] == "hello world"]但似乎找不到用部分字符串匹配说的方法'hello'

有人可以指出正确的方向吗?

I have a DataFrame with 4 columns of which 2 contain string values. I was wondering if there was a way to select rows based on a partial string match against a particular column?

In other words, a function or lambda function that would do something like

re.search(pattern, cell_in_question) 

returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can’t seem to find a way to do the same with a partial string match say 'hello'.

Would someone be able to point me in the right direction?


回答 0

基于github问题#620,看来您很快将能够执行以下操作:

df[df['A'].str.contains("hello")]

更新:熊猫0.8.1及更高版本中提供了矢量化字符串方法(即Series.str)

Based on github issue #620, it looks like you’ll soon be able to do the following:

df[df['A'].str.contains("hello")]

Update: vectorized string methods (i.e., Series.str) are available in pandas 0.8.1 and up.


回答 1

我尝试了上面提出的解决方案:

df[df["A"].str.contains("Hello|Britain")]

并得到一个错误:

ValueError:无法使用包含NA / NaN值的数组进行遮罩

您可以将NA值转换为False,如下所示:

df[df["A"].str.contains("Hello|Britain", na=False)]

I tried the proposed solution above:

df[df["A"].str.contains("Hello|Britain")]

and got an error:

ValueError: cannot mask with array containing NA / NaN values

you can transform NA values into False, like this:

df[df["A"].str.contains("Hello|Britain", na=False)]

回答 2

如何从熊猫DataFrame中按部分字符串选择?

这篇文章是为想要

  • 在字符串列中搜索子字符串(最简单的情况)
  • 搜索多个子字符串(类似于isin
  • 匹配文本中的整个单词(例如,“蓝色”应匹配“天空是蓝色”,而不是“ bluejay”)
  • 匹配多个完整词
  • 了解“ ValueError:无法使用包含NA / NaN值的向量进行索引”背后的原因

…并想进一步了解应优先采用哪种方法。

(PS:我在类似主题上看到了很多问题,我认为最好把它留在这里。)


基本子串搜索

# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

str.contains可用于执行子字符串搜索或基于正则表达式的搜索。搜索默认为基于正则表达式,除非您明确禁用它。

这是一个基于正则表达式的搜索示例,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

有时,不需要进行正则表达式搜索,因此请指定regex=False为禁用它。

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.

      col
0     foo
1  foobar

在性能方面,正则表达式搜索比子字符串搜索慢:

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

如果不需要,请避免使用基于正则表达式的搜索。

解决ValueError小号
有时,执行字符串搜索和对结果的过滤会导致

ValueError: cannot index with vector containing NA / NaN values

这通常是由于对象列中存在混合数据或NaN,

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

非字符串的任何内容都不能应用字符串方法,因此结果自然是NaN。在这种情况下,请指定na=False忽略非字符串数据,

s.str.contains('foo|bar', na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

多个子串搜索

通过使用正则表达式OR管道进行正则表达式搜索,最容易实现这一点。

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

您还可以创建一个术语列表,然后将其加入:

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

有时,明智的做法是将您的术语转义,以防它们包含可被解释为正则表达式元字符的字符。如果您的条款包含以下任何字符…

. ^ $ * + ? { } [ ] \ | ( )

然后,你就需要使用re.escape逃避它们:

import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]

          col
0     foo abc
1  foobar xyz
3      baz 45

re.escape 具有转义特殊字符的效果,因此可以按字面意义对待它们。

re.escape(r'.foo^')
# '\\.foo\\^'

匹配全词

默认情况下,子字符串搜索将搜索指定的子字符串/模式,而不管其是否为完整单词。为了仅匹配完整的单词,我们将需要在此处使用正则表达式-特别是,我们的模式将需要指定单词边界(\b)。

例如,

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window

现在考虑

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

伏/秒

df3[df3['col'].str.contains(r'\bblue\b')]

               col
0  the sky is blue

多个全字搜索

与上述类似,不同之处\b在于我们在连接的模式中添加了字边界()。

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

p这个样子的,

p
# '\\b(?:foo|baz)\\b'

一个很好的选择:使用列表推导

因为你能!而且你应该!它们通常比字符串方法快一点,因为字符串方法难以向量化并且通常具有循环实现。

代替,

df1[df1['col'].str.contains('foo', regex=False)]

in在列表组合中使用运算符,

df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

代替,

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

在列表组合中使用re.compile(用于缓存正则表达式)+ Pattern.search

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

如果“ col”具有NaN,则代替

df1[df1['col'].str.contains(regex_pattern, na=False)]

采用,

def try_search(p, x):
    try:
        return bool(p.search(x))
    except TypeError:
        return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]

      col
1  foobar

偏字符串匹配更多选项:np.char.findnp.vectorizeDataFrame.query

除了str.contains和列出理解,您还可以使用以下替代方法。

np.char.find
仅支持子字符串搜索(读取:无正则表达式)。

df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]

          col
0     foo abc
1  foobar xyz

np.vectorize
这是一个循环的包装器,但是比大多数pandas str方法要少。

f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

正则表达式解决方案可能:

regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

DataFrame.query
通过python引擎支持字符串方法。这没有提供明显的性能优势,但是对于了解是否需要动态生成查询很有用。

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

有关更多信息queryeval方法系列,请参见使用pd.eval()在大熊猫中进行动态表达评估。


推荐用法

  1. (第一) str.contains,因为它简单易用,可以处理NaN和混合数据
  2. 列出其性能的理解(特别是如果您的数据是纯字符串)
  3. np.vectorize
  4. (持续) df.query

How do I select by partial string from a pandas DataFrame?

This post is meant for readers who want to

  • search for a substring in a string column (the simplest case)
  • search for multiple substrings (similar to isin)
  • match a whole word from text (e.g., “blue” should match “the sky is blue” but not “bluejay”)
  • match multiple whole words
  • Understand the reason behind “ValueError: cannot index with vector containing NA / NaN values”

…and would like to know more about what methods should be preferred over others.

(P.S.: I’ve seen a lot of questions on similar topics, I thought it would be good to leave this here.)


Basic Substring Search

# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.

Here is an example of regex-based search,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

Sometimes regex search is not required, so specify regex=False to disable it.

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.

      col
0     foo
1  foobar

Performance wise, regex search is slower than substring search:

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Avoid using regex-based search if you don’t need it.

Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in

ValueError: cannot index with vector containing NA / NaN values

This is usually because of mixed data or NaNs in your object column,

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,

s.str.contains('foo|bar', na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

Multiple Substring Search

This is most easily achieved through a regex search using the regex OR pipe.

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

You can also create a list of terms, then join them:

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters…

. ^ $ * + ? { } [ ] \ | ( )

Then, you’ll need to use re.escape to escape them:

import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]

          col
0     foo abc
1  foobar xyz
3      baz 45

re.escape has the effect of escaping the special characters so they’re treated literally.

re.escape(r'.foo^')
# '\\.foo\\^'

Matching Entire Word(s)

By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).

For example,

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window

Now consider,

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

v/s

df3[df3['col'].str.contains(r'\bblue\b')]

               col
0  the sky is blue

Multiple Whole Word Search

Similar to the above, except we add a word boundary (\b) to the joined pattern.

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

Where p looks like this,

p
# '\\b(?:foo|baz)\\b'

A Great Alternative: Use List Comprehensions!

Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.

Instead of,

df1[df1['col'].str.contains('foo', regex=False)]

Use the in operator inside a list comp,

df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

Instead of,

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

Use re.compile (to cache your regex) + Pattern.search inside a list comp,

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

If “col” has NaNs, then instead of

df1[df1['col'].str.contains(regex_pattern, na=False)]

Use,

def try_search(p, x):
    try:
        return bool(p.search(x))
    except TypeError:
        return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]

      col
1  foobar

More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.

In addition to str.contains and list comprehensions, you can also use the following alternatives.

np.char.find
Supports substring searches (read: no regex) only.

df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]

          col
0     foo abc
1  foobar xyz

np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.

f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

Regex solutions possible:

regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

More information on query and eval family of methods can be found at Dynamic Expression Evaluation in pandas using pd.eval().


Recommended Usage Precedence

  1. (First) str.contains, for its simplicity and ease handling NaNs and mixed data
  2. List comprehensions, for its performance (especially if your data is purely strings)
  3. np.vectorize
  4. (Last) df.query

回答 3

如果有人想知道如何执行相关问题:“按部分字符串选择列”

采用:

df.filter(like='hello')  # select columns which contain the word hello

要通过部分字符串匹配选择行,请传递axis=0到过滤器:

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)  

If anyone wonders how to perform a related problem: “Select column by partial string”

Use:

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string matching, pass axis=0 to filter:

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)  

回答 4

快速说明:如果要基于索引中包含的部分字符串进行选择,请尝试以下操作:

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

Quick note: if you want to do selection based on a partial string contained in the index, try the following:

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

回答 5

说您有以下内容DataFrame

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

您始终可以in在lambda表达式中使用运算符来创建过滤器。

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

这里的技巧是使用中的axis=1选项apply将元素逐行(而不是逐列)传递给lambda函数。

Say you have the following DataFrame:

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

You can always use the in operator in a lambda expression to create your filter.

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.


回答 6

这就是我为部分字符串匹配所做的最终结果。如果有人有更有效的方法,请告诉我。

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

Here’s what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

回答 7

对于包含特殊字符的字符串,使用contains效果不佳。找到工作了。

df[df['A'].str.find("hello") != -1]

Using contains didn’t work well for my string with special characters. Find worked though.

df[df['A'].str.find("hello") != -1]

回答 8

在此之前,有一些答案可以完成所要求的功能,无论如何,我想以最普遍的方式展示:

df.filter(regex=".*STRING_YOU_LOOK_FOR.*")

这样,无论编写哪种方式,您都可以获取要查找的列。

(显然,您必须为每种情况编写正确的regex表达式)

There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:

df.filter(regex=".*STRING_YOU_LOOK_FOR.*")

This way let’s you get the column you look for whatever the way is wrote.

( Obviusly, you have to write the proper regex expression for each case )


回答 9

也许您想在Pandas数据框的所有列中搜索一些文本,而不仅仅是在它们的子集中。在这种情况下,以下代码将有所帮助。

df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]

警告。此方法相对较慢,但很方便。

Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.

df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]

Warning. This method is relatively slow, albeit convenient.


回答 10

如果您需要在pandas dataframe列中进行不区分大小写的搜索,请执行以下操作:

df[df['A'].str.contains("hello", case=False)]

Should you need to do a case insensitive search for a string in a pandas dataframe column:

df[df['A'].str.contains("hello", case=False)]

使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

问题:使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

我有一个数据框,df并且从中使用了几列groupby

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

通过以上方法,我几乎得到了所需的表(数据框)。缺少的是另外一列,其中包含每个组中的行数。换句话说,我有意思,但我也想知道有多少个数字被用来获得这些价值。例如,在第一组中有8个值,在第二组中有10个值,依此类推。

简而言之:如何获取数据框的分组统计信息?

I have a data frame df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

In short: How do I get group-wise statistics for a dataframe?


回答 0

groupby对象上,该agg函数可以列出一个列表,以一次应用多种聚合方法。这应该给您您需要的结果:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

回答 1

快速回答:

获取每个组的行数的最简单方法是调用.size(),它返回一个Series

df.groupby(['col1','col2']).size()


通常,您希望此结果为DataFrame(而不是Series),因此您可以执行以下操作:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


如果您想了解如何计算每组的行数和其他统计信息,请继续阅读下面的内容。


详细的例子:

考虑以下示例数据框:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

首先让我们.size()用来获取行数:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

然后让我们使用.size().reset_index(name='counts')来获取行数:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


包括结果以获取更多统计信息

当您要计算分组数据的统计信息时,通常如下所示:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

由于嵌套的列标签,并且行计数是基于每列的,因此上面的结果有点令人讨厌。

为了获得对输出的更多控制权,我通常将统计信息拆分为单独的汇总,然后使用进行合并join。看起来像这样:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



脚注

下面显示了用于生成测试数据的代码:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


免责声明:

如果您要聚合的某些列具有空值,那么您真的希望将组行计数视为每列的独立聚合。否则,您可能会误认为实际上有多少记录用于计算均值之类的东西,因为熊猫会NaN在均值计算中丢弃条目而不会告诉您。

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let’s use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let’s use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.


回答 2

一种功能统治一切: GroupBy.describe

返回countmeanstd,和其他有用的统计每个组。

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

要获取特定的统计信息,只需选择它们,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe适用于多列(更改['C']为(['C', 'D']或完全删除),看看会发生什么,结果是一个MultiIndexed列数据框)。

您还将获得不同的字符串数据统计信息。这是一个例子

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

有关更多信息,请参见文档

One Function to Rule Them All: GroupBy.describe

Returns count, mean, std, and other useful statistics per-group.

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

You also get different statistics for string data. Here’s an example,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation.


回答 3

我们可以使用groupby和count轻松地做到这一点。但是,我们应该记住使用reset_index()。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

回答 4

要获取多个统计信息,请折叠索引并保留列名:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

生成:

**在此处输入图片说明**

To get multiple stats, collapse the index, and retain column names:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

Produces:

**enter image description here**


回答 5

创建一个组对象并调用如下示例所示的方法:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

Create a group object and call methods like below example:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

回答 6

请尝试此代码

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

我认为该代码将添加一个名为“ count it”的列,每个列的计数

Please try this code

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

I think that code will add a column called ‘count it’ which count of each group


随机播放DataFrame行

问题:随机播放DataFrame行

我有以下DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

从csv文件读取DataFrame。所有具有Type1的行都在最上面,然后是具有Type2 的行,然后是具有Type3 的行,依此类推。

我想重新整理DataFrame行的顺序,以便将所有行Type混合在一起。可能的结果可能是:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

我该如何实现?

I have the following DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a csv file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame’s rows, so that all Type‘s are mixed. A possible result could be:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?


回答 0

使用Pandas的惯用方式是使用.sample数据框的方法对所有行进行采样而无需替换:

df.sample(frac=1)

frac关键字参数指定的行的分数到随机样品中返回,所以frac=1装置返回所有行(随机顺序)。


注意: 如果您希望就地改组数据帧并重置索引,则可以执行例如

df = df.sample(frac=1).reset_index(drop=True)

在此,指定drop=True可防止.reset_index创建包含旧索引条目的列。

后续注解:尽管上面的操作似乎并不就位,但是python / pandas足够聪明,不会为经过改组的对象做另一个malloc。也就是说,即使参考对象已更改(我的意思id(df_old)是与相同id(df_new)),底层C对象仍然相同。为了证明确实如此,您可以运行一个简单的内存探查器:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).


Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)


回答 1

您可以为此简单地使用sklearn

from sklearn.utils import shuffle
df = shuffle(df)

You can simply use sklearn for this

from sklearn.utils import shuffle
df = shuffle(df)

回答 2

您可以通过使用改组后的索引建立索引来改组数据帧的行。为此,您可以使用np.random.permutation(但np.random.choice也可以):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

如果要像示例中那样将索引的编号始终保持为1、2,..,n,则只需重置索引即可: df_shuffled.reset_index(drop=True)

You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation (but np.random.choice is also a possibility):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)


回答 3

TL; DRnp.random.shuffle(ndarray)可以胜任。
所以,在你的情况下

np.random.shuffle(DataFrame.values)

DataFrame在后台,使用NumPy ndarray作为数据持有者。(您可以从DataFrame源代码检查)

因此,如果使用np.random.shuffle(),它将沿多维数组的第一个轴随机排列数组。但是DataFrame遗体的索引仍然没有改组。

虽然,有一些要考虑的问题。

  • 函数不返回任何内容。如果要保留原始对象的副本,则必须这样做,然后再传递给该函数。
  • sklearn.utils.shuffle(),如用户tj89所建议的那样,可以指定random_state其他选项来控制输出。您可能需要出于开发目的。
  • sklearn.utils.shuffle()是比较快的。但洗牌的轴信息(索引,列)DataFrame与沿ndarray它包含的内容。

基准结果

sklearn.utils.shuffle()和之间np.random.shuffle()

ndarray

nd = sklearn.utils.shuffle(nd)

0.10793248389381915秒 快8倍

np.random.shuffle(nd)

0.8897626010002568秒

数据框

df = sklearn.utils.shuffle(df)

0.3183923360193148秒 快3倍

np.random.shuffle(df.values)

0.9357550159329548秒

结论:如果可以将轴信息(索引,列)与ndarray一起改组,请使用sklearn.utils.shuffle()。否则,使用np.random.shuffle()

使用的代码

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)

TL;DR: np.random.shuffle(ndarray) can do the job.
So, in your case

np.random.shuffle(DataFrame.values)

DataFrame, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)

So if you use np.random.shuffle(), it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame remains unshuffled.

Though, there are some points to consider.

  • function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
  • sklearn.utils.shuffle(), as user tj89 suggested, can designate random_state along with another option to control output. You may want that for dev purpose.
  • sklearn.utils.shuffle() is faster. But WILL SHUFFLE the axis info(index, column) of the DataFrame along with the ndarray it contains.

Benchmark result

between sklearn.utils.shuffle() and np.random.shuffle().

ndarray

nd = sklearn.utils.shuffle(nd)

0.10793248389381915 sec. 8x faster

np.random.shuffle(nd)

0.8897626010002568 sec

DataFrame

df = sklearn.utils.shuffle(df)

0.3183923360193148 sec. 3x faster

np.random.shuffle(df.values)

0.9357550159329548 sec

Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle(). Otherwise, use np.random.shuffle()

used code

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)


回答 4

(我没有足够的声誉在最高职位上对此发表评论,所以我希望其他人可以为我这样做。)第一种方法引起了人们的关注:

df.sample(frac=1)

进行深拷贝或只是更改数据框。我运行了以下代码:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

我的结果是:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

这意味着该方法返回上一个注释中建议的相同对象。因此,此方法的确可以制作随机的副本

(I don’t have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:

df.sample(frac=1)

made a deep copy or just changed the dataframe. I ran the following code:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

and my results were:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.


回答 5

还有用的是,如果将其用于Machine_learning并且希望始终分离相同的数据,则可以使用:

df.sample(n=len(df), random_state=42)

这样可以确保您的随机选择始终可复制

What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:

df.sample(n=len(df), random_state=42)

this makes sure, that you keep your random choice always replicatable


回答 6

AFAIK最简单的解决方案是:

df_shuffled = df.reindex(np.random.permutation(df.index))

AFAIK the simplest solution is:

df_shuffled = df.reindex(np.random.permutation(df.index))

回答 7

通过取样阵列中的这种情况下,洗牌大熊猫数据帧索引和随机那么它的顺序来设置所述阵列的数据帧的索引。现在根据索引对数据帧进行排序。这是您经过改组的数据框

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

输出

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

在上面的代码中将数据框插入我的位置。

shuffle the pandas data frame by taking a sample array in this case index and randomize its order then set the array as an index of data frame. Now sort the data frame according to index. Here goes your shuffled dataframe

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

output

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

Insert you data frame in the place of mine in above code .


回答 8

这是另一种方式:

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)

Here is another way:

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)