标签归档:pandas

根据列值删除Pandas中的DataFrame行

问题:根据列值删除Pandas中的DataFrame行

我有以下DataFrame:

             daysago  line_race rating        rw    wrating
 line_date                                                 
 2007-03-31       62         11     56  1.000000  56.000000
 2007-03-10       83         11     67  1.000000  67.000000
 2007-02-10      111          9     66  1.000000  66.000000
 2007-01-13      139         10     83  0.880678  73.096278
 2006-12-23      160         10     88  0.793033  69.786942
 2006-11-09      204          9     52  0.636655  33.106077
 2006-10-22      222          8     66  0.581946  38.408408
 2006-09-29      245          9     70  0.518825  36.317752
 2006-09-16      258         11     68  0.486226  33.063381
 2006-08-30      275          8     72  0.446667  32.160051
 2006-02-11      475          5     65  0.164591  10.698423
 2006-01-13      504          0     70  0.142409   9.968634
 2006-01-02      515          0     64  0.134800   8.627219
 2005-12-06      542          0     70  0.117803   8.246238
 2005-11-29      549          0     70  0.113758   7.963072
 2005-11-22      556          0     -1  0.109852  -0.109852
 2005-11-01      577          0     -1  0.098919  -0.098919
 2005-10-20      589          0     -1  0.093168  -0.093168
 2005-09-27      612          0     -1  0.083063  -0.083063
 2005-09-07      632          0     -1  0.075171  -0.075171
 2005-06-12      719          0     69  0.048690   3.359623
 2005-05-29      733          0     -1  0.045404  -0.045404
 2005-05-02      760          0     -1  0.039679  -0.039679
 2005-04-02      790          0     -1  0.034160  -0.034160
 2005-03-13      810          0     -1  0.030915  -0.030915
 2004-11-09      934          0     -1  0.016647  -0.016647

我需要删除line_race等于的行0。最有效的方法是什么?

I have the following DataFrame:

             daysago  line_race rating        rw    wrating
 line_date                                                 
 2007-03-31       62         11     56  1.000000  56.000000
 2007-03-10       83         11     67  1.000000  67.000000
 2007-02-10      111          9     66  1.000000  66.000000
 2007-01-13      139         10     83  0.880678  73.096278
 2006-12-23      160         10     88  0.793033  69.786942
 2006-11-09      204          9     52  0.636655  33.106077
 2006-10-22      222          8     66  0.581946  38.408408
 2006-09-29      245          9     70  0.518825  36.317752
 2006-09-16      258         11     68  0.486226  33.063381
 2006-08-30      275          8     72  0.446667  32.160051
 2006-02-11      475          5     65  0.164591  10.698423
 2006-01-13      504          0     70  0.142409   9.968634
 2006-01-02      515          0     64  0.134800   8.627219
 2005-12-06      542          0     70  0.117803   8.246238
 2005-11-29      549          0     70  0.113758   7.963072
 2005-11-22      556          0     -1  0.109852  -0.109852
 2005-11-01      577          0     -1  0.098919  -0.098919
 2005-10-20      589          0     -1  0.093168  -0.093168
 2005-09-27      612          0     -1  0.083063  -0.083063
 2005-09-07      632          0     -1  0.075171  -0.075171
 2005-06-12      719          0     69  0.048690   3.359623
 2005-05-29      733          0     -1  0.045404  -0.045404
 2005-05-02      760          0     -1  0.039679  -0.039679
 2005-04-02      790          0     -1  0.034160  -0.034160
 2005-03-13      810          0     -1  0.030915  -0.030915
 2004-11-09      934          0     -1  0.016647  -0.016647

I need to remove the rows where line_race is equal to 0. What’s the most efficient way to do this?


回答 0

如果我理解正确,它应该很简单:

df = df[df.line_race != 0]

If I’m understanding correctly, it should be as simple as:

df = df[df.line_race != 0]

回答 1

但是对于任何将来的绕过器,您可能会提到df = df[df.line_race != 0]在尝试过滤None/丢失值时不执行任何操作。

可以工作:

df = df[df.line_race != 0]

什么都不做:

df = df[df.line_race != None]

可以工作:

df = df[df.line_race.notnull()]

But for any future bypassers you could mention that df = df[df.line_race != 0] doesn’t do anything when trying to filter for None/missing values.

Does work:

df = df[df.line_race != 0]

Doesn’t do anything:

df = df[df.line_race != None]

Does work:

df = df[df.line_race.notnull()]

回答 2

最好的方法是使用布尔掩码:

In [56]: df
Out[56]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698
11  2006-01-13      504          0      70  0.142    9.969
12  2006-01-02      515          0      64  0.135    8.627
13  2005-12-06      542          0      70  0.118    8.246
14  2005-11-29      549          0      70  0.114    7.963
15  2005-11-22      556          0      -1  0.110   -0.110
16  2005-11-01      577          0      -1  0.099   -0.099
17  2005-10-20      589          0      -1  0.093   -0.093
18  2005-09-27      612          0      -1  0.083   -0.083
19  2005-09-07      632          0      -1  0.075   -0.075
20  2005-06-12      719          0      69  0.049    3.360
21  2005-05-29      733          0      -1  0.045   -0.045
22  2005-05-02      760          0      -1  0.040   -0.040
23  2005-04-02      790          0      -1  0.034   -0.034
24  2005-03-13      810          0      -1  0.031   -0.031
25  2004-11-09      934          0      -1  0.017   -0.017

In [57]: df[df.line_race != 0]
Out[57]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698

更新:现在熊猫0.13了,另一种方法是df.query('line_race != 0')

The best way to do this is with boolean masking:

In [56]: df
Out[56]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698
11  2006-01-13      504          0      70  0.142    9.969
12  2006-01-02      515          0      64  0.135    8.627
13  2005-12-06      542          0      70  0.118    8.246
14  2005-11-29      549          0      70  0.114    7.963
15  2005-11-22      556          0      -1  0.110   -0.110
16  2005-11-01      577          0      -1  0.099   -0.099
17  2005-10-20      589          0      -1  0.093   -0.093
18  2005-09-27      612          0      -1  0.083   -0.083
19  2005-09-07      632          0      -1  0.075   -0.075
20  2005-06-12      719          0      69  0.049    3.360
21  2005-05-29      733          0      -1  0.045   -0.045
22  2005-05-02      760          0      -1  0.040   -0.040
23  2005-04-02      790          0      -1  0.034   -0.034
24  2005-03-13      810          0      -1  0.031   -0.031
25  2004-11-09      934          0      -1  0.017   -0.017

In [57]: df[df.line_race != 0]
Out[57]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698

UPDATE: Now that pandas 0.13 is out, another way to do this is df.query('line_race != 0').


回答 3

只是添加另一种解决方案,如果您使用的是新的熊猫评估器,这将特别有用,其他解决方案将取代原始的熊猫并失去评估器

df.drop(df.loc[df['line_race']==0].index, inplace=True)

just to add another solution, particularly useful if you are using the new pandas assessors, other solutions will replace the original pandas and lose the assessors

df.drop(df.loc[df['line_race']==0].index, inplace=True)

回答 4

如果要基于列的多个值删除行,则可以使用:

df[(df.line_race != 0) & (df.line_race != 10)]

为删除所有值为0和10的行line_race

If you want to delete rows based on multiple values of the column, you could use:

df[(df.line_race != 0) & (df.line_race != 10)]

To drop all rows with values 0 and 10 for line_race.


回答 5

给出的答案仍然是正确的,因为上面的人说您可以使用df.query('line_race != 0')它,具体取决于您的问题要快得多。强烈推荐。

The given answer is correct nontheless as someone above said you can use df.query('line_race != 0') which depending on your problem is much faster. Highly recommend.


回答 6

虽然上一个答案几乎与我要执行的操作类似,但是使用索引方法不需要使用其他索引方法.loc()。可以通过类似但精确的方式完成

df.drop(df.index[df['line_race'] == 0], inplace = True)

Though the previou answer are almost similar to what I am going to do, but using the index method does not require using another indexing method .loc(). It can be done in a similar but precise manner as

df.drop(df.index[df['line_race'] == 0], inplace = True)

回答 7

另一种方式。可能不是最有效的方法,因为该代码看起来比其他答案中提到的代码更复杂,但是仍然是执行相同操作的另一种方法。

  df = df.drop(df[df['line_race']==0].index)

Another way of doing it. May not be the most efficient way as the code looks a bit more complex than the code mentioned in other answers, but still alternate way of doing the same thing.

  df = df.drop(df[df['line_race']==0].index)

回答 8

只需为DataFrame添加另一种方法即可扩展所有列:

for column in df.columns:
   df = df[df[column]!=0]

例:

def z_score(data,count):
   threshold=3
   for column in data.columns:
       mean = np.mean(data[column])
       std = np.std(data[column])
       for i in data[column]:
           zscore = (i-mean)/std
           if(np.abs(zscore)>threshold):
               count=count+1
               data = data[data[column]!=i]
   return data,count

Just adding another way for DataFrame expanded over all columns:

for column in df.columns:
   df = df[df[column]!=0]

Example:

def z_score(data,count):
   threshold=3
   for column in data.columns:
       mean = np.mean(data[column])
       std = np.std(data[column])
       for i in data[column]:
           zscore = (i-mean)/std
           if(np.abs(zscore)>threshold):
               count=count+1
               data = data[data[column]!=i]
   return data,count

将Pandas GroupBy输出从Series转换为DataFrame

问题:将Pandas GroupBy输出从Series转换为DataFrame

我从这样的输入数据开始

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

打印时显示为:

   City     Name
0   Seattle    Alice
1   Seattle      Bob
2  Portland  Mallory
3   Seattle  Mallory
4   Seattle      Bob
5  Portland  Mallory

分组非常简单:

g1 = df1.groupby( [ "Name", "City"] ).count()

打印产生一个GroupBy对象:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
        Seattle      1     1

但是我最终想要的是另一个DataFrame对象,该对象包含GroupBy对象中的所有行。换句话说,我想得到以下结果:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
Mallory Seattle      1     1

我在pandas文档中看不到如何完成此操作。任何提示都将受到欢迎。

I’m starting with input data like this

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

Which when printed appears as this:

   City     Name
0   Seattle    Alice
1   Seattle      Bob
2  Portland  Mallory
3   Seattle  Mallory
4   Seattle      Bob
5  Portland  Mallory

Grouping is simple enough:

g1 = df1.groupby( [ "Name", "City"] ).count()

and printing yields a GroupBy object:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
        Seattle      1     1

But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
Mallory Seattle      1     1

I can’t quite see how to accomplish this in the pandas documentation. Any hints would be welcome.


回答 0

g1一个DataFrame。但是,它具有层次结构索引:

In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame

In [20]: g1.index
Out[20]: 
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
       ('Mallory', 'Seattle')], dtype=object)

也许您想要这样的东西?

In [21]: g1.add_suffix('_Count').reset_index()
Out[21]: 
      Name      City  City_Count  Name_Count
0    Alice   Seattle           1           1
1      Bob   Seattle           2           2
2  Mallory  Portland           2           2
3  Mallory   Seattle           1           1

或类似的东西:

In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]: 
      Name      City  count
0    Alice   Seattle      1
1      Bob   Seattle      2
2  Mallory  Portland      2
3  Mallory   Seattle      1

g1 here is a DataFrame. It has a hierarchical index, though:

In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame

In [20]: g1.index
Out[20]: 
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
       ('Mallory', 'Seattle')], dtype=object)

Perhaps you want something like this?

In [21]: g1.add_suffix('_Count').reset_index()
Out[21]: 
      Name      City  City_Count  Name_Count
0    Alice   Seattle           1           1
1      Bob   Seattle           2           2
2  Mallory  Portland           2           2
3  Mallory   Seattle           1           1

Or something like:

In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]: 
      Name      City  count
0    Alice   Seattle      1
1      Bob   Seattle      2
2  Mallory  Portland      2
3  Mallory   Seattle      1

回答 1

我想稍微更改Wes给出的答案,因为版本0.16.2需要as_index=False。如果未设置,则会得到一个空的数据框。

资料来源

如果将聚集函数命名as_index=True为默认值列,则聚集函数将不会返回要聚集的组。分组的列将是返回对象的索引。

as_index=False如果它们被命名为“列”,则传递将返回您正在聚合的组。

聚合函数是那些减少返回的对象的尺寸,例如:meansumsizecountstdvarsemdescribefirstlastnthminmax。例如DataFrame.sum(),当您这样做并取回一个时,就会发生这种情况Series

nth可以充当减速器或过滤器,请参见此处

import pandas as pd

df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
                    "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
#       City     Name
#0   Seattle    Alice
#1   Seattle      Bob
#2  Portland  Mallory
#3   Seattle  Mallory
#4   Seattle      Bob
#5  Portland  Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
#                  City  Name
#Name    City
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1
#

编辑:

在版本0.17.1及更高版本,您可以使用subsetcountreset_index与参数namesize

print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range

print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]

print df1.groupby(["Name", "City"])[['Name','City']].count()
#                  Name  City
#Name    City                
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1

print df1.groupby(["Name", "City"]).size().reset_index(name='count')
#      Name      City  count
#0    Alice   Seattle      1
#1      Bob   Seattle      2
#2  Mallory  Portland      2
#3  Mallory   Seattle      1

count和之间的区别在于sizesize计算NaN值时count不计算。

I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False. If you don’t set it, you get an empty dataframe.

Source:

Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.

Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.

nth can act as a reducer or a filter, see here.

import pandas as pd

df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
                    "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
#       City     Name
#0   Seattle    Alice
#1   Seattle      Bob
#2  Portland  Mallory
#3   Seattle  Mallory
#4   Seattle      Bob
#5  Portland  Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
#                  City  Name
#Name    City
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1
#

EDIT:

In version 0.17.1 and later you can use subset in count and reset_index with parameter name in size:

print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range

print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]

print df1.groupby(["Name", "City"])[['Name','City']].count()
#                  Name  City
#Name    City                
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1

print df1.groupby(["Name", "City"]).size().reset_index(name='count')
#      Name      City  count
#0    Alice   Seattle      1
#1      Bob   Seattle      2
#2  Mallory  Portland      2
#3  Mallory   Seattle      1

The difference between count and size is that size counts NaN values while count does not.


回答 2

简单地说,这应该完成任务:

import pandas as pd

grouped_df = df1.groupby( [ "Name", "City"] )

pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))

在这里,grouped_df.size()提取唯一的groupby计数,reset_index()方法重置您想要的列名。最后,Dataframe()调用pandas 函数创建一个DataFrame对象。

Simply, this should do the task:

import pandas as pd

grouped_df = df1.groupby( [ "Name", "City"] )

pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))

Here, grouped_df.size() pulls up the unique groupby count, and reset_index() method resets the name of the column you want it to be. Finally, the pandas Dataframe() function is called upon to create a DataFrame object.


回答 3

关键是使用reset_index()方法。

采用:

import pandas

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()

现在,您在g1中有了新的数据

结果数据框

The key is to use the reset_index() method.

Use:

import pandas

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()

Now you have your new dataframe in g1:

result dataframe


回答 4

也许我误解了这个问题,但是如果您想将groupby转换回数据框,则可以使用.to_frame()。我想在执行此操作时重设索引,所以我也包括了该部分。

与问题无关的示例代码

df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])

Maybe I misunderstand the question but if you want to convert the groupby back to a dataframe you can use .to_frame(). I wanted to reset the index when I did this so I included that part as well.

example code unrelated to question

df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])

回答 5

我发现这对我有用。

import numpy as np
import pandas as pd

df1 = pd.DataFrame({ 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})

df1['City_count'] = 1
df1['Name_count'] = 1

df1.groupby(['Name', 'City'], as_index=False).count()

I found this worked for me.

import numpy as np
import pandas as pd

df1 = pd.DataFrame({ 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})

df1['City_count'] = 1
df1['Name_count'] = 1

df1.groupby(['Name', 'City'], as_index=False).count()

回答 6

下面的解决方案可能更简单:

df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()

Below solution may be simpler:

df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()

回答 7

我已经汇总了数量明智的数据并存储到数据框

almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
          )['Qty'].sum()}).reset_index()

I have aggregated with Qty wise data and store to dataframe

almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
          )['Qty'].sum()}).reset_index()

回答 8

这些解决方案仅对我部分起作用,因为我正在进行多个聚合。这是我要转换为数据框的分组的示例输出:

分组输出

因为我想要的不仅仅是reset_index()提供的计数,所以我写了一个手动方法将上面的图像转换为数据帧。我知道这不是最复杂的方法,因为它很冗长和明确,但这是我所需要的。基本上,使用上面说明的reset_index()方法来启动“脚手架”数据框,然后遍历分组数据框中的组配对,检索索引,针对未分组数据框执行计算,并在新的聚合数据框中设置值。

df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)

# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0

def manualAggregations(indices_array):
    temp_df = df.iloc[indices_array]
    return {
        'Male Count': temp_df['Male Count'].sum(),
        'Female Count': temp_df['Female Count'].sum(),
        'Job Rate': temp_df['Hourly Rate'].max()
    }

for name, group in df_grouped:
    ix = df_grouped.indices[name]
    calcDict = manualAggregations(ix)

    for key in calcDict:
        #Salary Basis, Job Title
        columns = list(name)
        df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                          (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]

如果不是字典,可以在for循环中内联应用计算:

    df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                                (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()

These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:

Groupby Output

Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a “scaffolding” dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.

df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)

# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0

def manualAggregations(indices_array):
    temp_df = df.iloc[indices_array]
    return {
        'Male Count': temp_df['Male Count'].sum(),
        'Female Count': temp_df['Female Count'].sum(),
        'Job Rate': temp_df['Hourly Rate'].max()
    }

for name, group in df_grouped:
    ix = df_grouped.indices[name]
    calcDict = manualAggregations(ix)

    for key in calcDict:
        #Salary Basis, Job Title
        columns = list(name)
        df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                          (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]

If a dictionary isn’t your thing, the calculations could be applied inline in the for loop:

    df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                                (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()

在pandas / python中的数据框中合并两列文本

问题:在pandas / python中的数据框中合并两列文本

我在Python中使用熊猫有20 x 4000数据框。其中两列分别命名为Yearquarter。我想创建一个名为periodmake Year = 2000quarter= q2into 的变量2000q2

有人可以帮忙吗?

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I’d like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2.

Can anyone help with that?


回答 0

如果两个列都是字符串,则可以直接将它们连接起来:

df["period"] = df["Year"] + df["quarter"]

如果其中一列(或两列)均未输入字符串,则应首先将其转换为字符串,

df["period"] = df["Year"].astype(str) + df["quarter"]

这样做时要小心NaN!


如果需要连接多个字符串列,则可以使用agg

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

其中“-”是分隔符。

if both columns are strings, you can concatenate them directly:

df["period"] = df["Year"] + df["quarter"]

If one (or both) of the columns are not string typed, you should convert it (them) first,

df["period"] = df["Year"].astype(str) + df["quarter"]

Beware of NaNs when doing this!


If you need to join multiple string columns, you can use agg:

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

Where “-” is the separator.


回答 1

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

产生此数据框

   Year quarter  period
0  2014      q1  2014q1
1  2015      q2  2015q2

此方法通过替换df[['Year', 'quarter']]为数据框的任何列切片(例如)将其推广为任意数量的字符串列df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1)

您可以在此处查看有关apply()方法的更多信息

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

Yields this dataframe

   Year quarter  period
0  2014      q1  2014q1
1  2015      q2  2015q2

This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']] with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).

You can check more information about apply() method here


回答 2

小型数据集(<150行)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

或稍慢但更紧凑:

df.Year.str.cat(df.quarter)

更大的数据集(> 150行)

df['Year'].astype(str) + df['quarter']

更新:时序图熊猫0.23.4

在此处输入图片说明

让我们在200K行DF上进行测试:

In [250]: df
Out[250]:
   Year quarter
0  2014      q1
1  2015      q2

In [251]: df = pd.concat([df] * 10**5)

In [252]: df.shape
Out[252]: (200000, 2)

更新:使用Pandas 0.19.0的新计时

定时不CPU / GPU优化(从排序最快到最慢):

In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop

In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop

In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop

In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop

In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop

In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop

时序采用CPU / GPU优化:

In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop

In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop

In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop

In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop

回答@ anton-vbr的贡献

Small data-sets (< 150rows)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

or slightly slower but more compact:

df.Year.str.cat(df.quarter)

Larger data sets (> 150rows)

df['Year'].astype(str) + df['quarter']

UPDATE: Timing graph Pandas 0.23.4

enter image description here

Let’s test it on 200K rows DF:

In [250]: df
Out[250]:
   Year quarter
0  2014      q1
1  2015      q2

In [251]: df = pd.concat([df] * 10**5)

In [252]: df.shape
Out[252]: (200000, 2)

UPDATE: new timings using Pandas 0.19.0

Timing without CPU/GPU optimization (sorted from fastest to slowest):

In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop

In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop

In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop

In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop

In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop

In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop

Timing using CPU/GPU optimization:

In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop

In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop

In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop

In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop

Answer contribution by @anton-vbr


回答 3

该方法cat()的的.str访问可以很好地表现这一点:

>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"], 
...                    ["2015", "q3"]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014      q1
1  2015      q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
   Year Quarter  Period
0  2014      q1  2014q1
1  2015      q3  2015q3

cat() 甚至允许您添加分隔符,因此,例如,假设年份和期间只有整数,则可以执行以下操作:

>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
...                    [2015, 3]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014       1
1  2015       3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
   Year Quarter  Period
0  2014       1  2014q1
1  2015       3  2015q3

连接多列只是传递一系列列表或包含除第一列之外的所有列的数据框作为要str.cat()在第一列(系列)上调用的参数的问题:

>>> df = pd.DataFrame(
...     [['USA', 'Nevada', 'Las Vegas'],
...      ['Brazil', 'Pernambuco', 'Recife']],
...     columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
  Country       State       City                   AllTogether
0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas
1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife

请注意,如果您的pandas数据框/系列具有空值,则需要包括参数na_rep以用字符串替换NaN值,否则合并的列将默认为NaN。

The method cat() of the .str accessor works really well for this:

>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"], 
...                    ["2015", "q3"]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014      q1
1  2015      q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
   Year Quarter  Period
0  2014      q1  2014q1
1  2015      q3  2015q3

cat() even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:

>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
...                    [2015, 3]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014       1
1  2015       3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
   Year Quarter  Period
0  2014       1  2014q1
1  2015       3  2015q3

Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat() invoked on the first column (Series):

>>> df = pd.DataFrame(
...     [['USA', 'Nevada', 'Las Vegas'],
...      ['Brazil', 'Pernambuco', 'Recife']],
...     columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
  Country       State       City                   AllTogether
0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas
1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife

Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.


回答 4

这次通过string.format()使用lamba函数。

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df

  Quarter  Year
0      q1  2014
1      q2  2015
  Quarter  Year YearQuarter
0      q1  2014      2014q1
1      q2  2015      2015q2

这使您可以根据需要使用非字符串并重新格式化值。

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df

df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df

Quarter     int64
Year       object
dtype: object
   Quarter  Year
0        1  2014
1        2  2015
   Quarter  Year YearQuarter
0        1  2014      2014q1
1        2  2015      2015q2

Use of a lamba function this time with string.format().

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df

  Quarter  Year
0      q1  2014
1      q2  2015
  Quarter  Year YearQuarter
0      q1  2014      2014q1
1      q2  2015      2015q2

This allows you to work with non-strings and reformat values as needed.

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df

df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df

Quarter     int64
Year       object
dtype: object
   Quarter  Year
0        1  2014
1        2  2015
   Quarter  Year YearQuarter
0        1  2014      2014q1
1        2  2015      2015q2

回答 5

您问题的简单答案。

    year    quarter
0   2000    q1
1   2000    q2

> df['year_quarter'] = df['year'] + '' + df['quarter']

> print(df['year_quarter'])
  2000q1
  2000q2

Simple answer for your question.

    year    quarter
0   2000    q1
1   2000    q2

> df['year_quarter'] = df['year'] + '' + df['quarter']

> print(df['year_quarter'])
  2000q1
  2000q2

回答 6

虽然@silvado的答案很好,但如果更改df.map(str)df.astype(str)它会更快:

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop

In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop

Although the @silvado answer is good if you change df.map(str) to df.astype(str) it will be faster:

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop

In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop

回答 7

让我们假设您 dataframedfYear和为Quarter

import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})

假设我们要看数据框;

df
>>>  Quarter    Year
   0    q1      2000
   1    q2      2000
   2    q3      2000
   3    q4      2000

最后,将Year和连接Quarter如下。

df['Period'] = df['Year'] + ' ' + df['Quarter']

现在print df ,您可以查看生成的数据框。

df
>>>  Quarter    Year    Period
    0   q1      2000    2000 q1
    1   q2      2000    2000 q2
    2   q3      2000    2000 q3
    3   q4      2000    2000 q4

如果您不想在年份和季度之间留出空间,只需将其删除即可;

df['Period'] = df['Year'] + df['Quarter']

Let us suppose your dataframe is df with columns Year and Quarter.

import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})

Suppose we want to see the dataframe;

df
>>>  Quarter    Year
   0    q1      2000
   1    q2      2000
   2    q3      2000
   3    q4      2000

Finally, concatenate the Year and the Quarter as follows.

df['Period'] = df['Year'] + ' ' + df['Quarter']

You can now print df to see the resulting dataframe.

df
>>>  Quarter    Year    Period
    0   q1      2000    2000 q1
    1   q2      2000    2000 q2
    2   q3      2000    2000 q3
    3   q4      2000    2000 q4

If you do not want the space between the year and quarter, simply remove it by doing;

df['Period'] = df['Year'] + df['Quarter']

回答 8

这是我发现非常通用的实现:

In [1]: import pandas as pd 

In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
   ...:                    [1, 'fox', 'jumps', 'over'], 
   ...:                    [2, 'the', 'lazy', 'dog']],
   ...:                   columns=['c0', 'c1', 'c2', 'c3'])

In [3]: def str_join(df, sep, *cols):
   ...:     from functools import reduce
   ...:     return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep), 
   ...:                   [df[col] for col in cols])
   ...: 

In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')

In [5]: df
Out[5]: 
   c0   c1     c2     c3                cat
0   0  the  quick  brown  0-the-quick-brown
1   1  fox  jumps   over   1-fox-jumps-over
2   2  the   lazy    dog     2-the-lazy-dog

Here is an implementation that I find very versatile:

In [1]: import pandas as pd 

In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
   ...:                    [1, 'fox', 'jumps', 'over'], 
   ...:                    [2, 'the', 'lazy', 'dog']],
   ...:                   columns=['c0', 'c1', 'c2', 'c3'])

In [3]: def str_join(df, sep, *cols):
   ...:     from functools import reduce
   ...:     return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep), 
   ...:                   [df[col] for col in cols])
   ...: 

In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')

In [5]: df
Out[5]: 
   c0   c1     c2     c3                cat
0   0  the  quick  brown  0-the-quick-brown
1   1  fox  jumps   over   1-fox-jumps-over
2   2  the   lazy    dog     2-the-lazy-dog

回答 9

将数据插入数据框时,此命令应该可以解决您的问题:

df['period'] = df[['Year', 'quarter']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

As your data are inserted to a dataframe, this command should solve your problem:

df['period'] = df[['Year', 'quarter']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

回答 10

更有效的是

def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)

这是一个时间测试:

import numpy as np
import pandas as pd

from time import time


def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)


def concat_df_str2(df):
    """ run time: 5.2758s """
    return df.astype(str).sum(axis=1)


def concat_df_str3(df):
    """ run time: 5.0076s """
    df = df.astype(str)
    return df[0] + df[1] + df[2] + df[3] + df[4] + \
           df[5] + df[6] + df[7] + df[8] + df[9]


def concat_df_str4(df):
    """ run time: 7.8624s """
    return df.astype(str).apply(lambda x: ''.join(x), axis=1)


def main():
    df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
    df = df.astype(int)

    time1 = time()
    df_en = concat_df_str4(df)
    print('run time: %.4fs' % (time() - time1))
    print(df_en.head(10))


if __name__ == '__main__':
    main()

最后,当使用sum(concat_df_str2)时,结果不是简单的concat,它将转换为整数。

more efficient is

def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)

and here is a time test:

import numpy as np
import pandas as pd

from time import time


def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)


def concat_df_str2(df):
    """ run time: 5.2758s """
    return df.astype(str).sum(axis=1)


def concat_df_str3(df):
    """ run time: 5.0076s """
    df = df.astype(str)
    return df[0] + df[1] + df[2] + df[3] + df[4] + \
           df[5] + df[6] + df[7] + df[8] + df[9]


def concat_df_str4(df):
    """ run time: 7.8624s """
    return df.astype(str).apply(lambda x: ''.join(x), axis=1)


def main():
    df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
    df = df.astype(int)

    time1 = time()
    df_en = concat_df_str4(df)
    print('run time: %.4fs' % (time() - time1))
    print(df_en.head(10))


if __name__ == '__main__':
    main()

final, when sum(concat_df_str2) is used, the result is not simply concat, it will trans to integer.


回答 11

归纳为多列,为什么不这样做:

columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)

generalising to multiple columns, why not:

columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)

回答 12

使用zip甚至可以更快:

df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

图形:

在此处输入图片说明

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

myfuncs = {
"df['Year'].astype(str) + df['quarter']":
    lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
    lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
    lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df[['Year','quarter']].astype(str).sum(axis=1),
    "df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
    lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
    "[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
    lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in myfuncs.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()

Using zip could be even quicker:

df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

Graph:

enter image description here

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

myfuncs = {
"df['Year'].astype(str) + df['quarter']":
    lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
    lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
    lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df[['Year','quarter']].astype(str).sum(axis=1),
    "df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
    lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
    "[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
    lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in myfuncs.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()

回答 13

最简单的解决方案:

通用解决方案

df['combined_col'] = df[['col1', 'col2']].astype(str).apply('-'.join, axis=1)

特定问题的解决方案

df['quarter_year'] = df[['quarter', 'year']].astype(str).apply(''.join, axis=1)

.join之前的引号内指定首选的分隔符

Simplest Solution:

Generic Solution

df['combined_col'] = df[['col1', 'col2']].astype(str).apply('-'.join, axis=1)

Question specific solution

df['quarter_year'] = df[['quarter', 'year']].astype(str).apply(''.join, axis=1)

Specify the preferred delimiter inside the quotes before .join


回答 14

此解决方案使用中间步骤将DataFrame的两列压缩为包含列表单列。这不仅适用于字符串,而且适用于所有类型的column-dtypes

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)

结果:

   Year quarter        list  period
0  2014      q1  [2014, q1]  2014q1
1  2015      q2  [2015, q2]  2015q2

This solution uses an intermediate step compressing two columns of the DataFrame to a single column containing a list of the values. This works not only for strings but for all kind of column-dtypes

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)

Result:

   Year quarter        list  period
0  2014      q1  [2014, q1]  2014q1
1  2015      q2  [2015, q2]  2015q2

回答 15

如前所述,您必须将每一列转换为字符串,然后使用加号运算符组合两个字符串列。使用NumPy可以大大提高性能。

%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.

%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

回答 16

我认为在pandas中组合列的最好方法是将两个列都转换为整数,然后转换为str。

df[['Year', 'quarter']] = df[['Year', 'quarter']].astype(int).astype(str)
df['Period']= df['Year'] + 'q' + df['quarter']

I think the best way to combine the columns in pandas is by converting both the columns to integer and then to str.

df[['Year', 'quarter']] = df[['Year', 'quarter']].astype(int).astype(str)
df['Period']= df['Year'] + 'q' + df['quarter']

回答 17

这是我上面的解决方案的摘要,该方法使用列值之间的分隔符将具有int和str值的两列连接/合并为新列。为此有三种解决方案。

# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError

separator = "&&" 

# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"

df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)

Here is my summary of the above solutions to concatenate / combine two columns with int and str value into a new column, using a separator between the values of columns. Three solutions work for this purpose.

# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError

separator = "&&" 

# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"

df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)

回答 18

使用.combine_first

df['Period'] = df['Year'].combine_first(df['Quarter'])

Use .combine_first.

df['Period'] = df['Year'].combine_first(df['Quarter'])

回答 19

def madd(x):
    """Performs element-wise string concatenation with multiple input arrays.

    Args:
        x: iterable of np.array.

    Returns: np.array.
    """
    for i, arr in enumerate(x):
        if type(arr.item(0)) is not str:
            x[i] = x[i].astype(str)
    return reduce(np.core.defchararray.add, x)

例如:

data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])

df

    Year    quarter period
0   2000    q1  2000q1
1   2000    q2  2000q2
2   2000    q3  2000q3
3   2000    q4  2000q4
def madd(x):
    """Performs element-wise string concatenation with multiple input arrays.

    Args:
        x: iterable of np.array.

    Returns: np.array.
    """
    for i, arr in enumerate(x):
        if type(arr.item(0)) is not str:
            x[i] = x[i].astype(str)
    return reduce(np.core.defchararray.add, x)

For example:

data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])

df

    Year    quarter period
0   2000    q1  2000q1
1   2000    q2  2000q2
2   2000    q3  2000q3
3   2000    q4  2000q4

回答 20

一个可以使用DataFrame的分配方法:

df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
  assign(period=lambda x: x.Year+x.quarter ))

One can use assign method of DataFrame:

df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
  assign(period=lambda x: x.Year+x.quarter ))

回答 21

dataframe["period"] = dataframe["Year"].astype(str).add(dataframe["quarter"])

或者如果值类似于[2000] [4]并想要设为[2000q4]

dataframe["period"] = dataframe["Year"].astype(str).add('q').add(dataframe["quarter"]).astype(str)

.astype(str).map(str)作品代替。

dataframe["period"] = dataframe["Year"].astype(str).add(dataframe["quarter"])

or if values are like [2000] [4] and want to make [2000q4]

dataframe["period"] = dataframe["Year"].astype(str).add('q').add(dataframe["quarter"]).astype(str)

substituting .astype(str) with .map(str) works too.


如何检查Pandas DataFrame中的值是否为NaN

问题:如何检查Pandas DataFrame中的值是否为NaN

在Python Pandas中,检查DataFrame是否具有一个(或多个)NaN值的最佳方法是什么?

我知道函数pd.isnan,但是这会为每个元素返回一个布尔值的DataFrame。此处的帖子也无法完全回答我的问题。

In Python Pandas, what’s the best way to check whether a DataFrame has one (or more) NaN values?

I know about the function pd.isnan, but this returns a DataFrame of booleans for each element. This post right here doesn’t exactly answer my question either.


回答 0

jwilner的反应是现场的。我一直在探索是否有更快的选择,因为根据我的经验,求平面数组的总和(奇怪)比计数快。这段代码看起来更快:

df.isnull().values.any()

例如:

In [2]: df = pd.DataFrame(np.random.randn(1000,1000))

In [3]: df[df > 0.9] = pd.np.nan

In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop

In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop

In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop

In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

df.isnull().sum().sum()速度稍慢,但当然还有其他信息-的数量NaNs

jwilner‘s response is spot on. I was exploring to see if there’s a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:

df.isnull().values.any()

For example:

In [2]: df = pd.DataFrame(np.random.randn(1000,1000))

In [3]: df[df > 0.9] = pd.np.nan

In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop

In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop

In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop

In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

df.isnull().sum().sum() is a bit slower, but of course, has additional information — the number of NaNs.


回答 1

您有两种选择。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan

现在数据框看起来像这样:

          0         1         2         3         4         5
0  0.520113  0.884000  1.260966 -0.236597  0.312972 -0.196281
1 -0.837552       NaN  0.143017  0.862355  0.346550  0.842952
2 -0.452595       NaN -0.420790  0.456215  1.203459  0.527425
3  0.317503 -0.917042  1.780938 -1.584102  0.432745  0.389797
4 -0.722852  1.704820 -0.113821 -1.466458  0.083002  0.011722
5 -0.622851 -0.251935 -1.498837       NaN  1.098323  0.273814
6  0.329585  0.075312 -0.690209 -3.807924  0.489317 -0.841368
7 -1.123433 -1.187496  1.868894 -2.046456 -0.949718       NaN
8  1.133880 -0.110447  0.050385 -1.158387  0.188222       NaN
9 -0.513741  1.196259  0.704537  0.982395 -0.585040 -1.693810
  • 选项1df.isnull().any().any()-返回布尔值

您知道isnull()哪个会返回这样的数据帧:

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False   True  False  False  False  False
2  False   True  False  False  False  False
3  False  False  False  False  False  False
4  False  False  False  False  False  False
5  False  False  False   True  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False   True
8  False  False  False  False  False   True
9  False  False  False  False  False  False

如果您这样做df.isnull().any(),则只能找到具有NaN值的列:

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

还有一个.any()会告诉你,如果上述任何有True

> df.isnull().any().any()
True
  • 选项2df.isnull().sum().sum()-返回NaN值总数的整数:

这与操作相同.any().any(),首先对NaN列中的值数量求和,然后对这些值求和:

df.isnull().sum()
0    0
1    2
2    0
3    1
4    0
5    2
dtype: int64

最后,要获取DataFrame中NaN值的总数:

df.isnull().sum().sum()
5

You have a couple of options.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan

Now the data frame looks something like this:

          0         1         2         3         4         5
0  0.520113  0.884000  1.260966 -0.236597  0.312972 -0.196281
1 -0.837552       NaN  0.143017  0.862355  0.346550  0.842952
2 -0.452595       NaN -0.420790  0.456215  1.203459  0.527425
3  0.317503 -0.917042  1.780938 -1.584102  0.432745  0.389797
4 -0.722852  1.704820 -0.113821 -1.466458  0.083002  0.011722
5 -0.622851 -0.251935 -1.498837       NaN  1.098323  0.273814
6  0.329585  0.075312 -0.690209 -3.807924  0.489317 -0.841368
7 -1.123433 -1.187496  1.868894 -2.046456 -0.949718       NaN
8  1.133880 -0.110447  0.050385 -1.158387  0.188222       NaN
9 -0.513741  1.196259  0.704537  0.982395 -0.585040 -1.693810
  • Option 1: df.isnull().any().any() – This returns a boolean value

You know of the isnull() which would return a dataframe like this:

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False   True  False  False  False  False
2  False   True  False  False  False  False
3  False  False  False  False  False  False
4  False  False  False  False  False  False
5  False  False  False   True  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False   True
8  False  False  False  False  False   True
9  False  False  False  False  False  False

If you make it df.isnull().any(), you can find just the columns that have NaN values:

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

One more .any() will tell you if any of the above are True

> df.isnull().any().any()
True
  • Option 2: df.isnull().sum().sum() – This returns an integer of the total number of NaN values:

This operates the same way as the .any().any() does, by first giving a summation of the number of NaN values in a column, then the summation of those values:

df.isnull().sum()
0    0
1    2
2    0
3    1
4    0
5    2
dtype: int64

Finally, to get the total number of NaN values in the DataFrame:

df.isnull().sum().sum()
5

回答 2

要找出特定列中具有NaN的行:

nan_rows = df[df['name column'].isnull()]

To find out which rows have NaNs in a specific column:

nan_rows = df[df['name column'].isnull()]

回答 3

如果您需要知道带有“一个或多个NaNs”的行数:

df.isnull().T.any().T.sum()

或者,如果您需要拉出这些行并进行检查:

nan_rows = df[df.isnull().T.any().T]

If you need to know how many rows there are with “one or more NaNs”:

df.isnull().T.any().T.sum()

Or if you need to pull out these rows and examine them:

nan_rows = df[df.isnull().T.any().T]

回答 4

df.isnull().any().any() 应该这样做。

df.isnull().any().any() should do it.


回答 5

除了给霍布斯一个绝妙的答案外,我对Python和Pandas还很陌生,所以请指出我是否错。

要找出哪些行具有NaN:

nan_rows = df[df.isnull().any(1)]

通过将any()的轴指定为1来检查行中是否存在“ True”,将无需移置即可执行相同的操作。

Adding to Hobs brilliant answer, I am very new to Python and Pandas so please point out if I am wrong.

To find out which rows have NaNs:

nan_rows = df[df.isnull().any(1)]

would perform the same operation without the need for transposing by specifying the axis of any() as 1 to check if ‘True’ is present in rows.


回答 6

超级简单语法: df.isna().any(axis=None)

从v0.23.2开始,可以使用DataFrame.isna+ DataFrame.any(axis=None)其中axis=None指定整个DataFrame的逻辑归约。

# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
     A    B
0  1.0  NaN
1  2.0  4.0
2  NaN  5.0

df.isna()

       A      B
0  False   True
1  False  False
2   True  False

df.isna().any(axis=None)
# True

有用的选择

numpy.isnan
如果您正在运行旧版本的熊猫,则是另一个性能选择。

np.isnan(df.values)

array([[False,  True],
       [False, False],
       [ True, False]])

np.isnan(df.values).any()
# True

或者,检查总和:

np.isnan(df.values).sum()
# 2

np.isnan(df.values).sum() > 0
# True

Series.hasnans
您也可以迭代调用Series.hasnans。例如,要检查单个列是否具有NaN,

df['A'].hasnans
# True

并检查任何列有NaN的,你可以使用与理解any(这是一个短路操作)。

any(df[c].hasnans for c in df)
# True

这实际上非常快。

Super Simple Syntax: df.isna().any(axis=None)

Starting from v0.23.2, you can use DataFrame.isna + DataFrame.any(axis=None) where axis=None specifies logical reduction over the entire DataFrame.

# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
     A    B
0  1.0  NaN
1  2.0  4.0
2  NaN  5.0

df.isna()

       A      B
0  False   True
1  False  False
2   True  False

df.isna().any(axis=None)
# True

Useful Alternatives

numpy.isnan
Another performant option if you’re running older versions of pandas.

np.isnan(df.values)

array([[False,  True],
       [False, False],
       [ True, False]])

np.isnan(df.values).any()
# True

Alternatively, check the sum:

np.isnan(df.values).sum()
# 2

np.isnan(df.values).sum() > 0
# True

Series.hasnans
You can also iteratively call Series.hasnans. For example, to check if a single column has NaNs,

df['A'].hasnans
# True

And to check if any column has NaNs, you can use a comprehension with any (which is a short-circuiting operation).

any(df[c].hasnans for c in df)
# True

This is actually very fast.


回答 7

由于没有人提及,因此只有一个名为的变量hasnans

df[i].hasnansTrue如果pandas系列中的一个或多个值是NaN,False则输出为NaN(如果不是)。请注意,它不是功能。

熊猫版本“ 0.19.2”和“ 0.20.2”

Since none have mentioned, there is just another variable called hasnans.

df[i].hasnans will output to True if one or more of the values in the pandas Series is NaN, False if not. Note that its not a function.

pandas version ‘0.19.2’ and ‘0.20.2’


回答 8

由于pandas必须为此找到答案DataFrame.dropna(),因此我看了看他们是如何实现它的,并发现他们利用了它DataFrame.count(),它计算了中的所有非空值DataFrame。cf. 熊猫源代码。我尚未对该技术进行基准测试,但是我认为该库的作者可能已经对如何进行选择做出了明智的选择。

Since pandas has to find this out for DataFrame.dropna(), I took a look to see how they implement it and discovered that they made use of DataFrame.count(), which counts all non-null values in the DataFrame. Cf. pandas source code. I haven’t benchmarked this technique, but I figure the authors of the library are likely to have made a wise choice for how to do it.


回答 9

dfPandas DataFrame的名称,以及任何numpy.nan为空值的值。

  1. 如果要查看哪些列为空,哪些不为空(仅True和False)
    df.isnull().any()
  2. 如果只想查看具有空值的列
    df.loc[:, df.isnull().any()].columns
  3. 如果要查看每列中的空值计数
    df.isna().sum()
  4. 如果要查看每列中空值的百分比

    df.isna().sum()/(len(df))*100
  5. 如果要查看仅包含空值的列中的空值百分比: df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100

编辑1:

如果要直观地查看数据丢失的位置:

import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])

let df be the name of the Pandas DataFrame and any value that is numpy.nan is a null value.

  1. If you want to see which columns has nulls and which not(just True and False)
    df.isnull().any()
    
  2. If you want to see only the columns that has nulls
    df.loc[:, df.isnull().any()].columns
    
  3. If you want to see the count of nulls in every column
    df.isna().sum()
    
  4. If you want to see the percentage of nulls in every column

    df.isna().sum()/(len(df))*100
    
  5. If you want to see the percentage of nulls in columns only with nulls: df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100

EDIT 1:

If you want to see where your data is missing visually:

import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])

回答 10

仅使用 math.isnan(x),如果x是一个NaN(不是数字),则返回True,否则返回False。

Just using math.isnan(x), Return True if x is a NaN (not a number), and False otherwise.


回答 11

df.isnull().sum()

这将为您提供DataFrame各个列中存在的所有NaN值的计数。

df.isnull().sum()

This will give you count of all NaN values present in the respective coloums of the DataFrame.


回答 12

这是找到空值并替换为计算值的另一种有趣方式

    #Creating the DataFrame

    testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3     NaN
    3       40       4     NaN
    4       50       5   250.0

    #Identifying the rows with empty columns
    nan_rows = testdf2[testdf2['Yearly'].isnull()]
    >>> nan_rows
       Monthly  Tenure  Yearly
    2       30       3     NaN
    3       40       4     NaN

    #Getting the rows# into a list
    >>> index = list(nan_rows.index)
    >>> index
    [2, 3]

    # Replacing null values with calculated value
    >>> for i in index:
        testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3    90.0
    3       40       4   160.0
    4       50       5   250.0

Here is another interesting way of finding null and replacing with a calculated value

    #Creating the DataFrame

    testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3     NaN
    3       40       4     NaN
    4       50       5   250.0

    #Identifying the rows with empty columns
    nan_rows = testdf2[testdf2['Yearly'].isnull()]
    >>> nan_rows
       Monthly  Tenure  Yearly
    2       30       3     NaN
    3       40       4     NaN

    #Getting the rows# into a list
    >>> index = list(nan_rows.index)
    >>> index
    [2, 3]

    # Replacing null values with calculated value
    >>> for i in index:
        testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3    90.0
    3       40       4   160.0
    4       50       5   250.0

回答 13

我一直在使用以下内容并将其类型转换为字符串并检查nan值

   (str(df.at[index, 'column']) == 'nan')

这使我可以检查序列中的特定值,而不仅仅是返回该值是否包含在序列中。

I’ve been using the following and type casting it to a string and checking for the nan value

   (str(df.at[index, 'column']) == 'nan')

This allows me to check specific value in a series and not just return if this is contained somewhere within the series.


回答 14

或者你可以使用.info()DF,例如:

df.info(null_counts=True) 返回列中的非空行数,例如:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches                          3276314 non-null int64
avg_pic_distance                   3276314 non-null float64

Or you can use .info() on the DF such as :

df.info(null_counts=True) which returns the number of non_null rows in a columns such as:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches                          3276314 non-null int64
avg_pic_distance                   3276314 non-null float64

回答 15

最好是使用:

df.isna().any().any()

这就是为什么。因此isna()用于定义isnull(),但两者当然是完全相同的。

这甚至比接受的答案还要快,并且涵盖了所有2D熊猫阵列。

The best would be to use:

df.isna().any().any()

Here is why. So isna() is used to define isnull(), but both of these are identical of course.

This is even faster than the accepted answer and covers all 2D panda arrays.


回答 16

import missingno as msno
msno.matrix(df)  # just to visualize. no missing value.

在此处输入图片说明

import missingno as msno
msno.matrix(df)  # just to visualize. no missing value.

enter image description here


回答 17

df.apply(axis=0, func=lambda x : any(pd.isnull(x)))

将检查每个列是否包含Nan。

df.apply(axis=0, func=lambda x : any(pd.isnull(x)))

Will check for each column if it contains Nan or not.


回答 18

我们可以通过使用Seaborn模块热图生成热图来查看数据集中存在的空值

import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)

We can see the null values present in the dataset by generating heatmap using seaborn moduleheatmap

import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)

回答 19

您不仅可以检查是否存在“ NaN”,还可以使用以下命令获取每一列中“ NaN”的百分比,

df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})  
df  

   col1 col2  
0   1   6.0  
1   2   NaN  
2   3   8.0  
3   4   9.0  
4   5   10.0  


df.isnull().sum()/len(df)  
col1    0.0  
col2    0.2  
dtype: float64

You could not only check if any ‘NaN’ exist but also get the percentage of ‘NaN’s in each column using the following,

df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})  
df  

   col1 col2  
0   1   6.0  
1   2   NaN  
2   3   8.0  
3   4   9.0  
4   5   10.0  


df.isnull().sum()/len(df)  
col1    0.0  
col2    0.2  
dtype: float64

回答 20

根据要处理的数据类型,您还可以通过将dropna设置为False来在执行EDA时获取每一列的值计数。

for col in df:
   print df[col].value_counts(dropna=False)

对于分类变量,效果很好,当您拥有许多唯一值时,效果不是很好。

Depending on the type of data you’re dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False.

for col in df:
   print df[col].value_counts(dropna=False)

Works well for categorical variables, not so much when you have many unique values.


使用索引为pandas DataFrame中的特定单元格设置值

问题:使用索引为pandas DataFrame中的特定单元格设置值

我创建了一个Pandas DataFrame

df = DataFrame(index=['A','B','C'], columns=['x','y'])

并得到这个

    y
NaN NaN
B NaN NaN
Na


然后,我想为特定的单元格赋值,例如行“ C”和列“ x”。我期望得到这样的结果:

    y
NaN NaN
B NaN NaN
C 10 NaN

使用此代码:

df.xs('C')['x'] = 10

但内容df没有改变。再次仅NaN在DataFrame中。

有什么建议么?

I’ve created a Pandas DataFrame

df = DataFrame(index=['A','B','C'], columns=['x','y'])

and got this

    x    y
A  NaN  NaN
B  NaN  NaN
C  NaN  NaN


Then I want to assign value to particular cell, for example for row ‘C’ and column ‘x’. I’ve expected to get such result:

    x    y
A  NaN  NaN
B  NaN  NaN
C  10  NaN

with this code:

df.xs('C')['x'] = 10

but contents of df haven’t changed. It’s again only NaNs in DataFrame.

Any suggestions?


回答 0

RukTech的答案df.set_value('C', 'x', 10)远比我在下面建议的选项要快得多。但是,已将其淘汰

展望未来,推荐的方法是.iat/.at


为什么df.xs('C')['x']=10不起作用:

df.xs('C')默认情况下,返回带有数据副本的新数据框,因此

df.xs('C')['x']=10

仅修改此新数据框。

df['x']返回df数据框的视图,因此

df['x']['C'] = 10

修改df自己。

警告:有时很难预测操作是否返回副本或视图。因此,文档建议避免使用“链接索引”进行赋值


所以推荐的替代方法是

df.at['C', 'x'] = 10

修改df


In [18]: %timeit df.set_value('C', 'x', 10)
100000 loops, best of 3: 2.9 µs per loop

In [20]: %timeit df['x']['C'] = 10
100000 loops, best of 3: 6.31 µs per loop

In [81]: %timeit df.at['C', 'x'] = 10
100000 loops, best of 3: 9.2 µs per loop

RukTech’s answer, df.set_value('C', 'x', 10), is far and away faster than the options I’ve suggested below. However, it has been slated for deprecation.

Going forward, the recommended method is .iat/.at.


Why df.xs('C')['x']=10 does not work:

df.xs('C') by default, returns a new dataframe with a copy of the data, so

df.xs('C')['x']=10

modifies this new dataframe only.

df['x'] returns a view of the df dataframe, so

df['x']['C'] = 10

modifies df itself.

Warning: It is sometimes difficult to predict if an operation returns a copy or a view. For this reason the docs recommend avoiding assignments with “chained indexing”.


So the recommended alternative is

df.at['C', 'x'] = 10

which does modify df.


In [18]: %timeit df.set_value('C', 'x', 10)
100000 loops, best of 3: 2.9 µs per loop

In [20]: %timeit df['x']['C'] = 10
100000 loops, best of 3: 6.31 µs per loop

In [81]: %timeit df.at['C', 'x'] = 10
100000 loops, best of 3: 9.2 µs per loop

回答 1

更新:该.set_value方法将不推荐使用.iat/.at是很好的替代品,不幸的是熊猫提供的文件很少


最快的方法是使用set_value。该方法比.ix方法快100倍。例如:

df.set_value('C', 'x', 10)

Update: The .set_value method is going to be deprecated. .iat/.at are good replacements, unfortunately pandas provides little documentation


The fastest way to do this is using set_value. This method is ~100 times faster than .ix method. For example:

df.set_value('C', 'x', 10)


回答 2

您还可以使用条件查询,.loc如下所示:

df.loc[df[<some_column_name>] == <condition>, [<another_column_name>]] = <value_to_add>

<some_column_name您要在其中检查<condition>变量的列在哪里,您要添加到的列在哪里<another_column_name>(可以是新列,也可以是已经存在的列)。<value_to_add>是要添加到该列/行的值。

该示例不能完全解决当前的问题,但对于希望根据条件添加特定值的人来说可能很有用。

You can also use a conditional lookup using .loc as seen here:

df.loc[df[<some_column_name>] == <condition>, [<another_column_name>]] = <value_to_add>

where <some_column_name is the column you want to check the <condition> variable against and <another_column_name> is the column you want to add to (can be a new column or one that already exists). <value_to_add> is the value you want to add to that column/row.

This example doesn’t work precisely with the question at hand, but it might be useful for someone wants to add a specific value based on a condition.


回答 3

推荐的方法(根据维护者)是:

df.ix['x','C']=10

使用“链接索引”(df['x']['C'])可能会导致问题。

看到:

The recommended way (according to the maintainers) to set a value is:

df.ix['x','C']=10

Using ‘chained indexing’ (df['x']['C']) may lead to problems.

See:


回答 4

尝试使用 df.loc[row_index,col_indexer] = value

Try using df.loc[row_index,col_indexer] = value


回答 5

这是唯一对我有用的东西!

df.loc['C', 'x'] = 10

.loc 在此处了解更多信息。

This is the only thing that worked for me!

df.loc['C', 'x'] = 10

Learn more about .loc here.


回答 6

.iat/.at是很好的解决方案。假设您有一个简单的data_frame:

   A   B   C
0  1   8   4 
1  3   9   6
2  22 33  52

如果我们要修改单元格的值,则[0,"A"]可以使用以下解决方案之一:

  1. df.iat[0,0] = 2
  2. df.at[0,'A'] = 2

这是一个完整的示例,说明如何iat用于获取和设置cell的值:

def prepossessing(df):
  for index in range(0,len(df)): 
      df.iat[index,0] = df.iat[index,0] * 2
  return df

y_train之前:

    0
0   54
1   15
2   15
3   8
4   31
5   63
6   11

调用预设函数后的y_train iat进行更改,以将每个单元格的值乘以2:

     0
0   108
1   30
2   30
3   16
4   62
5   126
6   22

.iat/.at is the good solution. Supposing you have this simple data_frame:

   A   B   C
0  1   8   4 
1  3   9   6
2  22 33  52

if we want to modify the value of the cell [0,"A"] u can use one of those solution :

  1. df.iat[0,0] = 2
  2. df.at[0,'A'] = 2

And here is a complete example how to use iat to get and set a value of cell :

def prepossessing(df):
  for index in range(0,len(df)): 
      df.iat[index,0] = df.iat[index,0] * 2
  return df

y_train before :

    0
0   54
1   15
2   15
3   8
4   31
5   63
6   11

y_train after calling prepossessing function that iat to change to multiply the value of each cell by 2:

     0
0   108
1   30
2   30
3   16
4   62
5   126
6   22

回答 7

要设置值,请使用:

df.at[0, 'clm1'] = 0
  • 推荐的最快的设置变量的方法。
  • set_valueix已弃用。
  • 没有警告,与ilocloc

To set values, use:

df.at[0, 'clm1'] = 0
  • The fastest recommended method for setting variables.
  • set_value, ix have been deprecated.
  • No warning, unlike iloc and loc

回答 8

您可以使用.iloc

df.iloc[[2], [0]] = 10

you can use .iloc.

df.iloc[[2], [0]] = 10

回答 9

在我的示例中,我只是在选定的单元格中对其进行了更改

    for index, row in result.iterrows():
        if np.isnan(row['weight']):
            result.at[index, 'weight'] = 0.0

“结果”是带有“权重”列的dataField

In my example i just change it in selected cell

    for index, row in result.iterrows():
        if np.isnan(row['weight']):
            result.at[index, 'weight'] = 0.0

‘result’ is a dataField with column ‘weight’


回答 10

set_value() 不推荐使用。

从版本0.23.4开始,Pandas“ 宣布了未来 ”。

>>> df
                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        245.0
2      Chevrolet Malibu        190.0
>>> df.set_value(2, 'Prices (U$)', 240.0)
__main__:1: FutureWarning: set_value is deprecated and will be removed in a future release.
Please use .at[] or .iat[] accessors instead

                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        245.0
2      Chevrolet Malibu        240.0

考虑到此建议,下面是如何使用它们的演示:

  • 按行/列整数位置

>>> df.iat[1, 1] = 260.0
>>> df
                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        260.0
2      Chevrolet Malibu        240.0
  • 按行/列标签

>>> df.at[2, "Cars"] = "Chevrolet Corvette"
>>> df
                  Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        260.0
2    Chevrolet Corvette        240.0

参考文献:

set_value() is deprecated.

Starting from the release 0.23.4, Pandas “announces the future“…

>>> df
                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        245.0
2      Chevrolet Malibu        190.0
>>> df.set_value(2, 'Prices (U$)', 240.0)
__main__:1: FutureWarning: set_value is deprecated and will be removed in a future release.
Please use .at[] or .iat[] accessors instead

                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        245.0
2      Chevrolet Malibu        240.0

Considering this advice, here’s a demonstration of how to use them:

  • by row/column integer positions

>>> df.iat[1, 1] = 260.0
>>> df
                   Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        260.0
2      Chevrolet Malibu        240.0
  • by row/column labels

>>> df.at[2, "Cars"] = "Chevrolet Corvette"
>>> df
                  Cars  Prices (U$)
0               Audi TT        120.0
1 Lamborghini Aventador        260.0
2    Chevrolet Corvette        240.0

References:


回答 11

这是所有用户针对整数和字符串索引的数据帧提供的有效解决方案的摘要。

df.iloc,df.loc和df.at适用于两种类型的数据帧,df.iloc仅适用于行/列整数索引,df.loc和df.at支持使用列名和/或整数索引设置值。

如果指定的索引不存在,则df.loc和df.at都将新插入的行/列追加到现有数据帧中,但是df.iloc将引发“ IndexError:位置索引器越界”。在Python 2.7和3.7中测试的一个工作示例如下:

import numpy as np, pandas as pd

df1 = pd.DataFrame(index=np.arange(3), columns=['x','y','z'])
df1['x'] = ['A','B','C']
df1.at[2,'y'] = 400

# rows/columns specified does not exist, appends new rows/columns to existing data frame
df1.at['D','w'] = 9000
df1.loc['E','q'] = 499

# using df[<some_column_name>] == <condition> to retrieve target rows
df1.at[df1['x']=='B', 'y'] = 10000
df1.loc[df1['x']=='B', ['z','w']] = 10000

# using a list of index to setup values
df1.iloc[[1,2,4], 2] = 9999
df1.loc[[0,'D','E'],'w'] = 7500
df1.at[[0,2,"D"],'x'] = 10
df1.at[:, ['y', 'w']] = 8000

df1
>>> df1
     x     y     z     w      q
0   10  8000   NaN  8000    NaN
1    B  8000  9999  8000    NaN
2   10  8000  9999  8000    NaN
D   10  8000   NaN  8000    NaN
E  NaN  8000  9999  8000  499.0

Here is a summary of the valid solutions provided by all users, for data frames indexed by integer and string.

df.iloc, df.loc and df.at work for both type of data frames, df.iloc only works with row/column integer indices, df.loc and df.at supports for setting values using column names and / or integer indices.

When the specified index does not exist, both df.loc and df.at would append the newly inserted rows/columns to the existing data frame, but df.iloc would raise “IndexError: positional indexers are out-of-bounds”. A working example tested in Python 2.7 and 3.7 is as follows:

import numpy as np, pandas as pd

df1 = pd.DataFrame(index=np.arange(3), columns=['x','y','z'])
df1['x'] = ['A','B','C']
df1.at[2,'y'] = 400

# rows/columns specified does not exist, appends new rows/columns to existing data frame
df1.at['D','w'] = 9000
df1.loc['E','q'] = 499

# using df[<some_column_name>] == <condition> to retrieve target rows
df1.at[df1['x']=='B', 'y'] = 10000
df1.loc[df1['x']=='B', ['z','w']] = 10000

# using a list of index to setup values
df1.iloc[[1,2,4], 2] = 9999
df1.loc[[0,'D','E'],'w'] = 7500
df1.at[[0,2,"D"],'x'] = 10
df1.at[:, ['y', 'w']] = 8000

df1
>>> df1
     x     y     z     w      q
0   10  8000   NaN  8000    NaN
1    B  8000  9999  8000    NaN
2   10  8000  9999  8000    NaN
D   10  8000   NaN  8000    NaN
E  NaN  8000  9999  8000  499.0

回答 12

我进行了测试,输出df.set_value速度稍快一些,但是官方方法df.at似乎是最快且不建议使用的方法。

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(100, 100))

%timeit df.iat[50,50]=50 # ✓
%timeit df.at[50,50]=50 #  ✔
%timeit df.set_value(50,50,50) # will deprecate
%timeit df.iloc[50,50]=50
%timeit df.loc[50,50]=50

7.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.52 µs ± 64.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.68 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
98.7 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

请注意,这是为单个单元格设置值。对于向量lociloc由于它们已向量化,因此应该是更好的选择。

I tested and the output is df.set_value is little faster, but the official method df.at looks like the fastest non deprecated way to do it.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(100, 100))

%timeit df.iat[50,50]=50 # ✓
%timeit df.at[50,50]=50 #  ✔
%timeit df.set_value(50,50,50) # will deprecate
%timeit df.iloc[50,50]=50
%timeit df.loc[50,50]=50

7.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.52 µs ± 64.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.68 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
98.7 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Note this is setting the value for a single cell. For the vectors loc and iloc should be better options since they are vectorized.


回答 13

将索引与条件一起使用的一种方法是,首先获取满足您条件的所有行的索引,然后简单地以多种方式使用这些行索引

conditional_index = df.loc[ df['col name'] <condition> ].index

示例条件如下

==5, >10 , =="Any string", >= DateTime

然后,您可以通过多种方式使用这些行索引,例如

  1. 将一列的值替换为conditional_index
df.loc[conditional_index , [col name]]= <new value>
  1. 将多列的值替换为conditional_index
df.loc[conditional_index, [col1,col2]]= <new value>
  1. 保存conditional_index的一个好处是,您可以将一列的值分配给具有相同行索引的另一列
df.loc[conditional_index, [col1,col2]]= df.loc[conditional_index,'col name']

这都是可能的,因为.index返回一个索引数组,.loc可以将其与直接寻址一起使用,从而避免了一次又一次的遍历。

One way to use index with condition is first get the index of all the rows that satisfy your condition and then simply use those row indexes in a multiple of ways

conditional_index = df.loc[ df['col name'] <condition> ].index

Example condition is like

==5, >10 , =="Any string", >= DateTime

Then you can use these row indexes in variety of ways like

  1. Replace value of one column for conditional_index
df.loc[conditional_index , [col name]]= <new value>
  1. Replace value of multiple column for conditional_index
df.loc[conditional_index, [col1,col2]]= <new value>
  1. One benefit with saving the conditional_index is that you can assign value of one column to another column with same row index
df.loc[conditional_index, [col1,col2]]= df.loc[conditional_index,'col name']

This is all possible because .index returns a array of index which .loc can use with direct addressing so it avoids traversals again and again.


回答 14

df.loc['c','x']=10 这将更改第c行和 第x列的值。

df.loc['c','x']=10 This will change the value of cth row and xth column.


回答 15

除上述答案外,这是一个基准测试,比较了将数据行添加到现有数据框的不同方法。它表明对于大型数据框(至少对于这些测试条件),使用at或set-value是最有效的方法。

  • 为每一行创建新的数据框并…
    • …附加(13.0 s)
    • …将其串联(13.1 s)
  • 首先将所有新行存储在另一个容器中,一次转换为新数据框,然后追加…
    • container =清单清单(2.0 s)
    • container =列表字典(1.9 s)
  • 预分配整个数据框,遍历新行和所有列,并使用
    • …在(0.6 s)
    • … set_value(0.4秒)

对于测试,使用了一个包含100,000行和1,000列以及随机numpy值的现有数据框。向该数据框添加了100个新行。

代码见下:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 21 16:38:46 2018

@author: gebbissimo
"""

import pandas as pd
import numpy as np
import time

NUM_ROWS = 100000
NUM_COLS = 1000
data = np.random.rand(NUM_ROWS,NUM_COLS)
df = pd.DataFrame(data)

NUM_ROWS_NEW = 100
data_tot = np.random.rand(NUM_ROWS + NUM_ROWS_NEW,NUM_COLS)
df_tot = pd.DataFrame(data_tot)

DATA_NEW = np.random.rand(1,NUM_COLS)


#%% FUNCTIONS

# create and append
def create_and_append(df):
    for i in range(NUM_ROWS_NEW):
        df_new = pd.DataFrame(DATA_NEW)
        df = df.append(df_new)
    return df

# create and concatenate
def create_and_concat(df):
    for i in range(NUM_ROWS_NEW):
        df_new = pd.DataFrame(DATA_NEW)
        df = pd.concat((df, df_new))
    return df


# store as dict and 
def store_as_list(df):
    lst = [[] for i in range(NUM_ROWS_NEW)]
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            lst[i].append(DATA_NEW[0,j])
    df_new = pd.DataFrame(lst)
    df_tot = df.append(df_new)
    return df_tot

# store as dict and 
def store_as_dict(df):
    dct = {}
    for j in range(NUM_COLS):
        dct[j] = []
        for i in range(NUM_ROWS_NEW):
            dct[j].append(DATA_NEW[0,j])
    df_new = pd.DataFrame(dct)
    df_tot = df.append(df_new)
    return df_tot




# preallocate and fill using .at
def fill_using_at(df):
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            #print("i,j={},{}".format(i,j))
            df.at[NUM_ROWS+i,j] = DATA_NEW[0,j]
    return df


# preallocate and fill using .at
def fill_using_set(df):
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            #print("i,j={},{}".format(i,j))
            df.set_value(NUM_ROWS+i,j,DATA_NEW[0,j])
    return df


#%% TESTS
t0 = time.time()    
create_and_append(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
create_and_concat(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
store_as_list(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
store_as_dict(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
fill_using_at(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
fill_using_set(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

In addition to the answers above, here is a benchmark comparing different ways to add rows of data to an already existing dataframe. It shows that using at or set-value is the most efficient way for large dataframes (at least for these test conditions).

  • Create new dataframe for each row and…
    • … append it (13.0 s)
    • … concatenate it (13.1 s)
  • Store all new rows in another container first, convert to new dataframe once and append…
    • container = lists of lists (2.0 s)
    • container = dictionary of lists (1.9 s)
  • Preallocate whole dataframe, iterate over new rows and all columns and fill using
    • … at (0.6 s)
    • … set_value (0.4 s)

For the test, an existing dataframe comprising 100,000 rows and 1,000 columns and random numpy values was used. To this dataframe, 100 new rows were added.

Code see below:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 21 16:38:46 2018

@author: gebbissimo
"""

import pandas as pd
import numpy as np
import time

NUM_ROWS = 100000
NUM_COLS = 1000
data = np.random.rand(NUM_ROWS,NUM_COLS)
df = pd.DataFrame(data)

NUM_ROWS_NEW = 100
data_tot = np.random.rand(NUM_ROWS + NUM_ROWS_NEW,NUM_COLS)
df_tot = pd.DataFrame(data_tot)

DATA_NEW = np.random.rand(1,NUM_COLS)


#%% FUNCTIONS

# create and append
def create_and_append(df):
    for i in range(NUM_ROWS_NEW):
        df_new = pd.DataFrame(DATA_NEW)
        df = df.append(df_new)
    return df

# create and concatenate
def create_and_concat(df):
    for i in range(NUM_ROWS_NEW):
        df_new = pd.DataFrame(DATA_NEW)
        df = pd.concat((df, df_new))
    return df


# store as dict and 
def store_as_list(df):
    lst = [[] for i in range(NUM_ROWS_NEW)]
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            lst[i].append(DATA_NEW[0,j])
    df_new = pd.DataFrame(lst)
    df_tot = df.append(df_new)
    return df_tot

# store as dict and 
def store_as_dict(df):
    dct = {}
    for j in range(NUM_COLS):
        dct[j] = []
        for i in range(NUM_ROWS_NEW):
            dct[j].append(DATA_NEW[0,j])
    df_new = pd.DataFrame(dct)
    df_tot = df.append(df_new)
    return df_tot




# preallocate and fill using .at
def fill_using_at(df):
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            #print("i,j={},{}".format(i,j))
            df.at[NUM_ROWS+i,j] = DATA_NEW[0,j]
    return df


# preallocate and fill using .at
def fill_using_set(df):
    for i in range(NUM_ROWS_NEW):
        for j in range(NUM_COLS):
            #print("i,j={},{}".format(i,j))
            df.set_value(NUM_ROWS+i,j,DATA_NEW[0,j])
    return df


#%% TESTS
t0 = time.time()    
create_and_append(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
create_and_concat(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
store_as_list(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
store_as_dict(df)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
fill_using_at(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

t0 = time.time()    
fill_using_set(df_tot)
t1 = time.time()
print('Needed {} seconds'.format(t1-t0))

回答 16

如果您不想更改整个行的值,而只更改某些列的值:

x = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
x.iloc[1] = dict(A=10, B=-10)

If you want to change values not for whole row, but only for some columns:

x = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
x.iloc[1] = dict(A=10, B=-10)

回答 17

从0.21.1版开始,您还可以使用.at方法。与.loc此处提到的相比-Pandas .at与.loc有一些区别,但是在单值替换上它更快

From version 0.21.1 you can also use .at method. There are some differences compared to .loc as mentioned here – pandas .at versus .loc, but it’s faster on single value replacement


回答 18

如此,您的问题是将[‘x’,C]的NaN转换为值10

答案是..

df['x'].loc['C':]=10
df

替代代码是

df.loc['C':'x']=10
df

Soo, your question to convert NaN at [‘x’,C] to value 10

the answer is..

df['x'].loc['C':]=10
df

alternative code is

df.loc['C':'x']=10
df

回答 19

我也在寻找这个主题,并且提出了一种方法来遍历DataFrame并使用第二个DataFrame中的查找值对其进行更新。这是我的代码。

src_df = pd.read_sql_query(src_sql,src_connection)
for index1, row1 in src_df.iterrows():
    for index, row in vertical_df.iterrows():
        src_df.set_value(index=index1,col=u'etl_load_key',value=etl_load_key)
        if (row1[u'src_id'] == row['SRC_ID']) is True:
            src_df.set_value(index=index1,col=u'vertical',value=row['VERTICAL'])

I too was searching for this topic and I put together a way to iterate through a DataFrame and update it with lookup values from a second DataFrame. Here is my code.

src_df = pd.read_sql_query(src_sql,src_connection)
for index1, row1 in src_df.iterrows():
    for index, row in vertical_df.iterrows():
        src_df.set_value(index=index1,col=u'etl_load_key',value=etl_load_key)
        if (row1[u'src_id'] == row['SRC_ID']) is True:
            src_df.set_value(index=index1,col=u'vertical',value=row['VERTICAL'])

Pandas中map,applymap和apply方法之间的区别

问题:Pandas中map,applymap和apply方法之间的区别

您能否通过基本示例告诉我何时使用这些矢量化方法?

我看到这map是一种Series方法,而其余都是DataFrame方法。我糊涂了约applyapplymap,虽然方法。为什么我们有两种将函数应用于DataFrame的方法?同样,简单的例子可以很好地说明用法!

Can you tell me when to use these vectorization methods with basic examples?

I see that map is a Series method whereas the rest are DataFrame methods. I got confused about apply and applymap methods though. Why do we have two methods for applying a function to a DataFrame? Again, simple examples which illustrate the usage would be great!


回答 0

直接来自Wes McKinney的《Python for Data Analysis》一书,第16页。132(我强烈推荐这本书):

另一种常见的操作是将一维数组上的函数应用于每一列或每一行。DataFrame的apply方法正是这样做的:

In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [117]: frame
Out[117]: 
               b         d         e
Utah   -0.029638  1.081563  1.280300
Ohio    0.647747  0.831136 -1.549481
Texas   0.513416 -0.884417  0.195343
Oregon -0.485454 -0.477388 -0.309548

In [118]: f = lambda x: x.max() - x.min()

In [119]: frame.apply(f)
Out[119]: 
b    1.133201
d    1.965980
e    2.829781
dtype: float64

许多最常见的数组统计信息(例如sum和mean)都是DataFrame方法,因此不必使用apply。

也可以使用基于元素的Python函数。假设您要根据帧中的每个浮点值来计算格式化的字符串。您可以使用applymap做到这一点:

In [120]: format = lambda x: '%.2f' % x

In [121]: frame.applymap(format)
Out[121]: 
            b      d      e
Utah    -0.03   1.08   1.28
Ohio     0.65   0.83  -1.55
Texas    0.51  -0.88   0.20
Oregon  -0.49  -0.48  -0.31

之所以使用applymap之所以命名,是因为Series具有用于应用逐元素函数的map方法:

In [122]: frame['e'].map(format)
Out[122]: 
Utah       1.28
Ohio      -1.55
Texas      0.20
Oregon    -0.31
Name: e, dtype: object

总结起来,apply在DataFrame的行/列基础上工作,在DataFrame applymap上按map元素工作,在Series上按元素工作。

Straight from Wes McKinney’s Python for Data Analysis book, pg. 132 (I highly recommended this book):

Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:

In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [117]: frame
Out[117]: 
               b         d         e
Utah   -0.029638  1.081563  1.280300
Ohio    0.647747  0.831136 -1.549481
Texas   0.513416 -0.884417  0.195343
Oregon -0.485454 -0.477388 -0.309548

In [118]: f = lambda x: x.max() - x.min()

In [119]: frame.apply(f)
Out[119]: 
b    1.133201
d    1.965980
e    2.829781
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:

In [120]: format = lambda x: '%.2f' % x

In [121]: frame.applymap(format)
Out[121]: 
            b      d      e
Utah    -0.03   1.08   1.28
Ohio     0.65   0.83  -1.55
Texas    0.51  -0.88   0.20
Oregon  -0.49  -0.48  -0.31

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [122]: frame['e'].map(format)
Out[122]: 
Utah       1.28
Ohio      -1.55
Texas      0.20
Oregon    -0.31
Name: e, dtype: object

Summing up, apply works on a row / column basis of a DataFrame, applymap works element-wise on a DataFrame, and map works element-wise on a Series.


回答 1

比较mapapplymap和:上下文问题apply

第一个主要区别:定义

  • map 仅在系列上定义
  • applymap 仅在DataFrames上定义
  • apply 两者都定义

第二个主要区别:输入参数

  • map接受dictS, Series,或可调用
  • applymap并且apply只接受可调用

第三大区别:行为

  • map 对于系列是元素
  • applymap 对于DataFrames是元素
  • apply也可以逐元素工作,但适用于更复杂的操作和聚合。行为和返回值取决于函数。

四主要的区别(最重要的):用例

  • map是用于将值从一个域映射到另一个域,因此针对性能进行了优化(例如df['A'].map({1:'a', 2:'b', 3:'c'})
  • applymap适用于跨多个行/列的元素转换(例如df[['A', 'B', 'C']].applymap(str.strip)
  • apply用于应用无法向量化的任何功能(例如df['sentences'].apply(nltk.sent_tokenize)

总结

在此处输入图片说明

脚注

  1. map通过字典/系列时,将基于该字典/系列中的键映射元素。缺少的值将在输出中记录为NaN。
  2. applymap在最新版本中,已针对某些操作进行了优化。您会发现applymapapply某些情况下要快一些。我的建议是对它们都进行测试,并使用更好的方法。

  3. map针对元素映射和转换进行了优化。涉及字典或系列的操作将使熊猫能够使用更快的代码路径来获得更好的性能。

  4. Series.apply返回用于汇总操作的标量,否则返回Series。同样适用于DataFrame.apply。需要注意的是apply,当某些NumPy的功能,如所谓的也有FastPaths的meansum等等。

Comparing map, applymap and apply: Context Matters

First major difference: DEFINITION

  • map is defined on Series ONLY
  • applymap is defined on DataFrames ONLY
  • apply is defined on BOTH

Second major difference: INPUT ARGUMENT

  • map accepts dicts, Series, or callable
  • applymap and apply accept callables only

Third major difference: BEHAVIOR

  • map is elementwise for Series
  • applymap is elementwise for DataFrames
  • apply also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.

Fourth major difference (the most important one): USE CASE

  • map is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))
  • applymap is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))
  • apply is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize))

Summarising

enter image description here

Footnotes

  1. map when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as NaN in the output.
  2. applymap in more recent versions has been optimised for some operations. You will find applymap slightly faster than apply in some cases. My suggestion is to test them both and use whatever works better.

  3. map is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance.

  4. Series.apply returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply. Note that apply also has fastpaths when called with certain NumPy functions such as mean, sum, etc.

回答 2

这些答案中有很多有用的信息,但是我要添加自己的信息,以明确总结哪些方法在数组方式与元素方式下均有效。jeremiahbuddha主要这样做,但没有提及Series.apply。我没有代表对此发表评论。

  • DataFrame.apply 一次对整个行或列进行操作。

  • DataFrame.applymap,,Series.apply并同时Series.map对一个元素进行操作。

Series.apply和的功能之间有很多重叠之处Series.map,这意味着任何一种在大多数情况下都可以使用。但是,它们确实有一些细微的差异,其中一些已在osa的答案中进行了讨论。

There’s great information in these answers, but I’m adding my own to clearly summarize which methods work array-wise versus element-wise. jeremiahbuddha mostly did this but did not mention Series.apply. I don’t have the rep to comment.

  • DataFrame.apply operates on entire rows or columns at a time.

  • DataFrame.applymap, Series.apply, and Series.map operate on one element at time.

There is a lot of overlap between the capabilities of Series.apply and Series.map, meaning that either one will work in most cases. They do have some slight differences though, some of which were discussed in osa’s answer.


回答 3

除了其他答案外,Series还有mapapply

Apply可以使DataFrame脱离系列;但是,map只会将一个系列放在另一个系列的每个单元格中,这可能不是您想要的。

In [40]: p=pd.Series([1,2,3])
In [41]: p
Out[31]:
0    1
1    2
2    3
dtype: int64

In [42]: p.apply(lambda x: pd.Series([x, x]))
Out[42]: 
   0  1
0  1  1
1  2  2
2  3  3

In [43]: p.map(lambda x: pd.Series([x, x]))
Out[43]: 
0    0    1
1    1
dtype: int64
1    0    2
1    2
dtype: int64
2    0    3
1    3
dtype: int64
dtype: object

另外,如果我有一个带有副作用的功能,例如“连接到Web服务器”,那么我可能apply只是为了清楚起见而使用。

series.apply(download_file_for_every_element) 

Map不仅可以使用功能,还可以使用字典或其他系列。假设您要操纵排列

采取

1 2 3 4 5
2 1 4 5 3

此排列的平方是

1 2 3 4 5
1 2 5 3 4

您可以使用进行计算map。不知道自助申请是否已记录在案,但可以在中使用0.15.1

In [39]: p=pd.Series([1,0,3,4,2])

In [40]: p.map(p)
Out[40]: 
0    0
1    1
2    4
3    2
4    3
dtype: int64

Adding to the other answers, in a Series there are also map and apply.

Apply can make a DataFrame out of a series; however, map will just put a series in every cell of another series, which is probably not what you want.

In [40]: p=pd.Series([1,2,3])
In [41]: p
Out[31]:
0    1
1    2
2    3
dtype: int64

In [42]: p.apply(lambda x: pd.Series([x, x]))
Out[42]: 
   0  1
0  1  1
1  2  2
2  3  3

In [43]: p.map(lambda x: pd.Series([x, x]))
Out[43]: 
0    0    1
1    1
dtype: int64
1    0    2
1    2
dtype: int64
2    0    3
1    3
dtype: int64
dtype: object

Also if I had a function with side effects, such as “connect to a web server”, I’d probably use apply just for the sake of clarity.

series.apply(download_file_for_every_element) 

Map can use not only a function, but also a dictionary or another series. Let’s say you want to manipulate permutations.

Take

1 2 3 4 5
2 1 4 5 3

The square of this permutation is

1 2 3 4 5
1 2 5 3 4

You can compute it using map. Not sure if self-application is documented, but it works in 0.15.1.

In [39]: p=pd.Series([1,0,3,4,2])

In [40]: p.map(p)
Out[40]: 
0    0
1    1
2    4
3    2
4    3
dtype: int64

回答 4

@jeremiahbuddha提到了apply在行/列上的工作,而applymap在元素上工作。但似乎您仍可以使用apply进行元素计算。

    frame.apply(np.sqrt)
    Out[102]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

    frame.applymap(np.sqrt)
    Out[103]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

@jeremiahbuddha mentioned that apply works on row/columns, while applymap works element-wise. But it seems you can still use apply for element-wise computation….

    frame.apply(np.sqrt)
    Out[102]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

    frame.applymap(np.sqrt)
    Out[103]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

回答 5

只是想指出一点,因为我为此苦了一点

def f(x):
    if x < 0:
        x = 0
    elif x > 100000:
        x = 100000
    return x

df.applymap(f)
df.describe()

这不会修改数据框本身,必须重新分配

df = df.applymap(f)
df.describe()

Just wanted to point out, as I struggled with this for a bit

def f(x):
    if x < 0:
        x = 0
    elif x > 100000:
        x = 100000
    return x

df.applymap(f)
df.describe()

this does not modify the dataframe itself, has to be reassigned

df = df.applymap(f)
df.describe()

回答 6

可能最简单的解释apply和applymap之间的区别:

apply将整个列作为参数,然后将结果分配给该列

applymap将单独的单元格值作为参数,并将结果分配回该单元格。

注意:如果apply返回单个值,则分配后将具有该值而不是列,最终将仅具有一行而不是矩阵。

Probably simplest explanation the difference between apply and applymap:

apply takes the whole column as a parameter and then assign the result to this column

applymap takes the separate cell value as a parameter and assign the result back to this cell.

NB If apply returns the single value you will have this value instead of the column after assigning and eventually will have just a row instead of matrix.


回答 7

我的理解:

从功能上看:

如果函数具有需要在列/行中进行比较的变量,请使用 apply

例如:lambda x: x.max()-x.mean()

如果要将函数应用于每个元素:

1>如果找到列/行,请使用 apply

2>如果适用于整个数据框,请使用 applymap

majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)

def times10(x):
  if type(x) is int:
    x *= 10 
  return x
df2.applymap(times10)

My understanding:

From the function point of view:

If the function has variables that need to compare within a column/ row, use apply.

e.g.: lambda x: x.max()-x.mean().

If the function is to be applied to each element:

1> If a column/row is located, use apply

2> If apply to entire dataframe, use applymap

majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)

def times10(x):
  if type(x) is int:
    x *= 10 
  return x
df2.applymap(times10)

回答 8

基于cs95的答案

  • map 仅在系列上定义
  • applymap 仅在DataFrames上定义
  • apply 两者都定义

举一些例子

In [3]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [4]: frame
Out[4]:
            b         d         e
Utah    0.129885 -0.475957 -0.207679
Ohio   -2.978331 -1.015918  0.784675
Texas  -0.256689 -0.226366  2.262588
Oregon  2.605526  1.139105 -0.927518

In [5]: myformat=lambda x: f'{x:.2f}'

In [6]: frame.d.map(myformat)
Out[6]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [7]: frame.d.apply(myformat)
Out[7]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [8]: frame.applymap(myformat)
Out[8]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93

In [9]: frame.apply(lambda x: x.apply(myformat))
Out[9]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93


In [10]: myfunc=lambda x: x**2

In [11]: frame.applymap(myfunc)
Out[11]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

In [12]: frame.apply(myfunc)
Out[12]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

Based on the answer of cs95

  • map is defined on Series ONLY
  • applymap is defined on DataFrames ONLY
  • apply is defined on BOTH

give some examples

In [3]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [4]: frame
Out[4]:
            b         d         e
Utah    0.129885 -0.475957 -0.207679
Ohio   -2.978331 -1.015918  0.784675
Texas  -0.256689 -0.226366  2.262588
Oregon  2.605526  1.139105 -0.927518

In [5]: myformat=lambda x: f'{x:.2f}'

In [6]: frame.d.map(myformat)
Out[6]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [7]: frame.d.apply(myformat)
Out[7]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [8]: frame.applymap(myformat)
Out[8]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93

In [9]: frame.apply(lambda x: x.apply(myformat))
Out[9]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93


In [10]: myfunc=lambda x: x**2

In [11]: frame.applymap(myfunc)
Out[11]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

In [12]: frame.apply(myfunc)
Out[12]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

回答 9

FOMO:

以下示例显示applyapplymap应用于DataFrame

map函数仅适用于Series。您不能map 在DataFrame上申请。

需要记住的是, apply可以做任何事情 applymap都可以,但apply额外的选项。

X因子选项包括:axisresult_typeresult_type仅在axis=1(仅适用于)时才适用。

df = DataFrame(1, columns=list('abc'),
                  index=list('1234'))
print(df)

f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only

# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1))  # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result

map附带说明一下,Series 函数不应与Python map函数混淆。

第一个应用于Series,以映射值,第二个应用于迭代对象的每个项目。


最后,不要将dataframe apply方法与groupby apply方法混淆。

FOMO:

The following example shows apply and applymap applied to a DataFrame.

map function is something you do apply on Series only. You cannot apply map on DataFrame.

The thing to remember is that apply can do anything applymap can, but apply has eXtra options.

The X factor options are: axis and result_type where result_type only works when axis=1 (for columns).

df = DataFrame(1, columns=list('abc'),
                  index=list('1234'))
print(df)

f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only

# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1))  # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result

As a sidenote, Series map function, should not be confused with the Python map function.

The first one is applied on Series, to map the values, and the second one to every item of an iterable.


Lastly don’t confuse the dataframe apply method with groupby apply method.


将pandas数据框转换为NumPy数组

问题:将pandas数据框转换为NumPy数组

我对知道如何将熊猫数据框转换为NumPy数组感兴趣。

数据框:

import numpy as np
import pandas as pd

index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')

label   A    B    C
ID                                 
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

我想将其转换为NumPy数组,如下所示:

array([[ nan,  0.2,  nan],
       [ nan,  nan,  0.5],
       [ nan,  0.2,  0.5],
       [ 0.1,  0.2,  nan],
       [ 0.1,  0.2,  0.5],
       [ 0.1,  nan,  0.5],
       [ 0.1,  nan,  nan]])

我怎样才能做到这一点?


另外,是否可以像这样保留dtype?

array([[ 1, nan,  0.2,  nan],
       [ 2, nan,  nan,  0.5],
       [ 3, nan,  0.2,  0.5],
       [ 4, 0.1,  0.2,  nan],
       [ 5, 0.1,  0.2,  0.5],
       [ 6, 0.1,  nan,  0.5],
       [ 7, 0.1,  nan,  nan]],
     dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

或类似的?

I am interested in knowing how to convert a pandas dataframe into a NumPy array.

dataframe:

import numpy as np
import pandas as pd

index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')

gives

label   A    B    C
ID                                 
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

I would like to convert this to a NumPy array, as so:

array([[ nan,  0.2,  nan],
       [ nan,  nan,  0.5],
       [ nan,  0.2,  0.5],
       [ 0.1,  0.2,  nan],
       [ 0.1,  0.2,  0.5],
       [ 0.1,  nan,  0.5],
       [ 0.1,  nan,  nan]])

How can I do this?


As a bonus, is it possible to preserve the dtypes, like this?

array([[ 1, nan,  0.2,  nan],
       [ 2, nan,  nan,  0.5],
       [ 3, nan,  0.2,  0.5],
       [ 4, 0.1,  0.2,  nan],
       [ 5, 0.1,  0.2,  0.5],
       [ 6, 0.1,  nan,  0.5],
       [ 7, 0.1,  nan,  nan]],
     dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

or similar?


回答 0

要将熊猫数据框(df)转换为numpy ndarray,请使用以下代码:

df.values

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

To convert a pandas dataframe (df) to a numpy ndarray, use this code:

df.values

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

回答 1

弃用的用法valuesas_matrix()

pandas v0.24.0引入了两种从pandas对象获取NumPy数组的新方法:

  1. to_numpy(),其定义上IndexSeries,DataFrame对象,并
  2. array,仅在IndexSeries对象上定义。

如果您访问的v0.24文档.values,则会看到一个红色的大警告:

警告:我们建议DataFrame.to_numpy()改为使用。

请参阅v0.24.0发行说明的本部分以及此答案以获取更多信息。


追求更好的一致性: to_numpy()

本着整个API更好的一致性的精神,to_numpy引入了一种新方法来从DataFrames中提取底层的NumPy数组。

# Setup.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

df.to_numpy()
array([[1, 4],
       [2, 5],
       [3, 6]])

如上所述,该方法也在IndexSeries对象上定义(请参见此处)。

df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)

df['A'].to_numpy()
#  array([1, 2, 3])

默认情况下,将返回视图,因此所做的任何修改都会影响原始视图。

v = df.to_numpy()
v[0, 0] = -1

df
   A  B
a -1  4
b  2  5
c  3  6

如果您需要副本,请使用to_numpy(copy=True

Pandas> = 1.0更新为ExtensionTypes

如果您使用的是熊猫1.x,那么您可能会更多地处理扩展类型。您必须多加注意,这些扩展名类型已正确转换。

a = pd.array([1, 2, None], dtype="Int64")                                  
a                                                                          

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64 

# Wrong
a.to_numpy()                                                               
# array([1, 2, <NA>], dtype=object)  # yuck, objects

# Right
a.to_numpy(dtype='float', na_value=np.nan)                                 
# array([ 1.,  2., nan])

在文档中对此进行了标注

如果您需要dtypes

如另一个答案所示,DataFrame.to_records是执行此操作的好方法。

df.to_records()
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8')])

to_numpy不幸的是,这不能做到。但是,您也可以使用np.rec.fromrecords

v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#          dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8')])

在性能方面,它几乎是相同的(实际上,使用rec.fromrecords速度要快一些)。

df2 = pd.concat([df] * 10000)

%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())

11.1 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.67 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

添加新方法的理由

to_numpy()(除了array)作为两个GitHub问题GH19954GH23623下讨论的结果而添加

具体来说,文档提到了基本原理:

[…] .values尚不清楚返回的值是实际数组,它的某种转换还是熊猫自定义数组之一(如Categorical)。例如,使用PeriodIndex,每次都会.values 生成一个新ndarray的周期对象。[…]

to_numpy旨在提高API的一致性,这是朝正确方向迈出的重要一步。.values不会在当前版本中被弃用,但我希望这种情况可能会在将来的某个时刻发生,因此,我敦促用户尽快向较新的API迁移。


批判其他解决方案

DataFrame.values 如前所述,具有不一致的行为。

DataFrame.get_values()只是一个包装器DataFrame.values,因此上述所有内容均适用。

DataFrame.as_matrix()现在已弃用,请勿使用!

Deprecate your usage of values and as_matrix()!

pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects:

  1. to_numpy(), which is defined on Index, Series, and DataFrame objects, and
  2. array, which is defined on Index and Series objects only.

If you visit the v0.24 docs for .values, you will see a big red warning that says:

Warning: We recommend using DataFrame.to_numpy() instead.

See this section of the v0.24.0 release notes, and this answer for more information.


Towards Better Consistency: to_numpy()

In the spirit of better consistency throughout the API, a new method to_numpy has been introduced to extract the underlying NumPy array from DataFrames.

# Setup.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

df.to_numpy()
array([[1, 4],
       [2, 5],
       [3, 6]])

As mentioned above, this method is also defined on Index and Series objects (see here).

df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)

df['A'].to_numpy()
#  array([1, 2, 3])

By default, a view is returned, so any modifications made will affect the original.

v = df.to_numpy()
v[0, 0] = -1

df
   A  B
a -1  4
b  2  5
c  3  6

If you need a copy instead, use to_numpy(copy=True).

pandas >= 1.0 update for ExtensionTypes

If you’re using pandas 1.x, chances are you’ll be dealing with extension types a lot more. You’ll have to be a little more careful that these extension types are correctly converted.

a = pd.array([1, 2, None], dtype="Int64")                                  
a                                                                          

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64 

# Wrong
a.to_numpy()                                                               
# array([1, 2, <NA>], dtype=object)  # yuck, objects

# Right
a.to_numpy(dtype='float', na_value=np.nan)                                 
# array([ 1.,  2., nan])

This is called out in the docs.

If you need the dtypes

As shown in another answer, DataFrame.to_records is a good way to do this.

df.to_records()
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8')])

This cannot be done with to_numpy, unfortunately. However, as an alternative, you can use np.rec.fromrecords:

v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#          dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8')])

Performance wise, it’s nearly the same (actually, using rec.fromrecords is a bit faster).

df2 = pd.concat([df] * 10000)

%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())

11.1 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.67 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Rationale for Adding a New Method

to_numpy() (in addition to array) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[…] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. […]

to_numpy aim to improve the consistency of the API, which is a major step in the right direction. .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.


Critique of Other Solutions

DataFrame.values has inconsistent behaviour, as already noted.

DataFrame.get_values() is simply a wrapper around DataFrame.values, so everything said above applies.

DataFrame.as_matrix() is deprecated now, do NOT use!


回答 2

注意.as_matrix()不建议使用此答案中的方法。熊猫0.23.4警告:

方法.as_matrix将在以后的版本中删除。请改用.values。


熊猫内置一些东西…

numpy_matrix = df.as_matrix()

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

Note: The .as_matrix() method used in this answer is deprecated. Pandas 0.23.4 warns:

Method .as_matrix will be removed in a future version. Use .values instead.


Pandas has something built in…

numpy_matrix = df.as_matrix()

gives

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

回答 3

我只需要链接DataFrame.reset_index()DataFrame.values函数来获得数据帧的Numpy表示,包括索引:

In [8]: df
Out[8]: 
          A         B         C
0 -0.982726  0.150726  0.691625
1  0.617297 -0.471879  0.505547
2  0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758  1.178659
4 -0.164103  0.074516 -0.674325
5 -0.340169 -0.293698  1.231791
6 -1.062825  0.556273  1.508058
7  0.959610  0.247539  0.091333

[8 rows x 3 columns]

In [9]: df.reset_index().values
Out[9]:
array([[ 0.        , -0.98272574,  0.150726  ,  0.69162512],
       [ 1.        ,  0.61729734, -0.47187926,  0.50554728],
       [ 2.        ,  0.4171228 , -1.35680324, -1.01349922],
       [ 3.        , -0.16636303, -0.95775849,  1.17865945],
       [ 4.        , -0.16410334,  0.0745164 , -0.67432474],
       [ 5.        , -0.34016865, -0.29369841,  1.23179064],
       [ 6.        , -1.06282542,  0.55627285,  1.50805754],
       [ 7.        ,  0.95961001,  0.24753911,  0.09133339]])

为了获得dtypes,我们需要使用view将此ndarray转换为结构化数组:

In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574,  0.150726  ,  0.69162512),
       ( 1,  0.61729734, -0.47187926,  0.50554728),
       ( 2,  0.4171228 , -1.35680324, -1.01349922),
       ( 3, -0.16636303, -0.95775849,  1.17865945),
       ( 4, -0.16410334,  0.0745164 , -0.67432474),
       ( 5, -0.34016865, -0.29369841,  1.23179064),
       ( 6, -1.06282542,  0.55627285,  1.50805754),
       ( 7,  0.95961001,  0.24753911,  0.09133339),
       dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

I would just chain the DataFrame.reset_index() and DataFrame.values functions to get the Numpy representation of the dataframe, including the index:

In [8]: df
Out[8]: 
          A         B         C
0 -0.982726  0.150726  0.691625
1  0.617297 -0.471879  0.505547
2  0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758  1.178659
4 -0.164103  0.074516 -0.674325
5 -0.340169 -0.293698  1.231791
6 -1.062825  0.556273  1.508058
7  0.959610  0.247539  0.091333

[8 rows x 3 columns]

In [9]: df.reset_index().values
Out[9]:
array([[ 0.        , -0.98272574,  0.150726  ,  0.69162512],
       [ 1.        ,  0.61729734, -0.47187926,  0.50554728],
       [ 2.        ,  0.4171228 , -1.35680324, -1.01349922],
       [ 3.        , -0.16636303, -0.95775849,  1.17865945],
       [ 4.        , -0.16410334,  0.0745164 , -0.67432474],
       [ 5.        , -0.34016865, -0.29369841,  1.23179064],
       [ 6.        , -1.06282542,  0.55627285,  1.50805754],
       [ 7.        ,  0.95961001,  0.24753911,  0.09133339]])

To get the dtypes we’d need to transform this ndarray into a structured array using view:

In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574,  0.150726  ,  0.69162512),
       ( 1,  0.61729734, -0.47187926,  0.50554728),
       ( 2,  0.4171228 , -1.35680324, -1.01349922),
       ( 3, -0.16636303, -0.95775849,  1.17865945),
       ( 4, -0.16410334,  0.0745164 , -0.67432474),
       ( 5, -0.34016865, -0.29369841,  1.23179064),
       ( 6, -1.06282542,  0.55627285,  1.50805754),
       ( 7,  0.95961001,  0.24753911,  0.09133339),
       dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

回答 4

您可以使用该to_records方法,但是如果dtypes不是您一开始就想要的,就必须多花点时间。就我而言,从字符串复制了DF,索引类型为字符串(以objectdtypes在pandas中表示):

In [102]: df
Out[102]: 
label    A    B    C
ID                  
1      NaN  0.2  NaN
2      NaN  NaN  0.5
3      NaN  0.2  0.5
4      0.1  0.2  NaN
5      0.1  0.2  0.5
6      0.1  NaN  0.5
7      0.1  NaN  NaN

In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]: 
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

转换recarray dtype对我不起作用,但是已经可以在Pandas中做到这一点:

In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

请注意,Pandas ID在导出的记录数组中没有正确地将索引名设置为(错误?),因此我们可以从类型转换中受益,也可以对此进行更正。

目前,Pandas只有8个字节的整数i8,并且浮点数f8(请参阅本期)。

You can use the to_records method, but have to play around a bit with the dtypes if they are not what you want from the get go. In my case, having copied your DF from a string, the index type is string (represented by an object dtype in pandas):

In [102]: df
Out[102]: 
label    A    B    C
ID                  
1      NaN  0.2  NaN
2      NaN  NaN  0.5
3      NaN  0.2  0.5
4      0.1  0.2  NaN
5      0.1  0.2  0.5
6      0.1  NaN  0.5
7      0.1  NaN  NaN

In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]: 
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Converting the recarray dtype does not work for me, but one can do this in Pandas already:

In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Note that Pandas does not set the name of the index properly (to ID) in the exported record array (a bug?), so we profit from the type conversion to also correct for that.

At the moment Pandas has only 8-byte integers, i8, and floats, f8 (see this issue).


回答 5

似乎df.to_records()会为您工作。您正在寻找的确切功能已被要求to_records指出作为替代。

我使用您的示例在本地进行了尝试,该调用产生的结果与您正在寻找的输出非常相似:

rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)],
      dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])

请注意,这是一个recarray而不是array。您可以通过将其构造函数调用为,将结果移入常规numpy数组np.array(df.to_records())

It seems like df.to_records() will work for you. The exact feature you’re looking for was requested and to_records pointed to as an alternative.

I tried this out locally using your example, and that call yields something very similar to the output you were looking for:

rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)],
      dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])

Note that this is a recarray rather than an array. You could move the result in to regular numpy array by calling its constructor as np.array(df.to_records()).


回答 6

尝试这个:

a = numpy.asarray(df)

Try this:

a = numpy.asarray(df)

回答 7

这是我从pandas DataFrame制作结构数组的方法。

创建数据框

import pandas as pd
import numpy as np
import six

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

定义函数以从pandas DataFrame中创建一个numpy结构数组(而不是记录数组)。

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns

    if six.PY2:  # python 2 needs .encode() but 3 does not
        types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    else:
        types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

使用reset_index使包括索引作为其数据的一部分,新的数据帧。将该数据帧转换为结构数组。

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

编辑:更新df_to_sarray以避免错误调用.encode()与Python 3.感谢约瑟夫·加尔文宁静 为他们的意见和解决方案。

Here is my approach to making a structure array from a pandas DataFrame.

Create the data frame

import pandas as pd
import numpy as np
import six

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

Define function to make a numpy structure array (not a record array) from a pandas DataFrame.

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns

    if six.PY2:  # python 2 needs .encode() but 3 does not
        types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    else:
        types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

Use reset_index to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

EDIT: Updated df_to_sarray to avoid error calling .encode() with python 3. Thanks to Joseph Garvin and halcyon for their comment and solution.


回答 8

将数据帧转换为其Numpy数组表示形式的两种方法。

  • mah_np_array = df.as_matrix(columns=None)

  • mah_np_array = df.values

Doc:https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html

Two ways to convert the data-frame to its Numpy-array representation.

  • mah_np_array = df.as_matrix(columns=None)

  • mah_np_array = df.values

Doc: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html


回答 9

示例DataFrame的更简单方法:

df

         gbm       nnet        reg
0  12.097439  12.047437  12.100953
1  12.109811  12.070209  12.095288
2  11.720734  11.622139  11.740523
3  11.824557  11.926414  11.926527
4  11.800868  11.727730  11.729737
5  12.490984  12.502440  12.530894

采用:

np.array(df.to_records().view(type=np.matrix))

得到:

array([[(0, 12.097439  , 12.047437, 12.10095324),
        (1, 12.10981081, 12.070209, 12.09528824),
        (2, 11.72073428, 11.622139, 11.74052253),
        (3, 11.82455653, 11.926414, 11.92652727),
        (4, 11.80086775, 11.72773 , 11.72973699),
        (5, 12.49098389, 12.50244 , 12.53089367)]],
dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'),
       ('reg', '<f8')]))

A Simpler Way for Example DataFrame:

df

         gbm       nnet        reg
0  12.097439  12.047437  12.100953
1  12.109811  12.070209  12.095288
2  11.720734  11.622139  11.740523
3  11.824557  11.926414  11.926527
4  11.800868  11.727730  11.729737
5  12.490984  12.502440  12.530894

USE:

np.array(df.to_records().view(type=np.matrix))

GET:

array([[(0, 12.097439  , 12.047437, 12.10095324),
        (1, 12.10981081, 12.070209, 12.09528824),
        (2, 11.72073428, 11.622139, 11.74052253),
        (3, 11.82455653, 11.926414, 11.92652727),
        (4, 11.80086775, 11.72773 , 11.72973699),
        (5, 12.49098389, 12.50244 , 12.53089367)]],
dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'),
       ('reg', '<f8')]))

回答 10

从数据框导出到arcgis表时遇到了类似的问题,偶然发现了来自usgs的解决方案(https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table)。简而言之,您的问题具有类似的解决方案:

df

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

np_data = np.array(np.rec.fromrecords(df.values))
np_names = df.dtypes.index.tolist()
np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])

np_data

array([( nan,  0.2,  nan), ( nan,  nan,  0.5), ( nan,  0.2,  0.5),
       ( 0.1,  0.2,  nan), ( 0.1,  0.2,  0.5), ( 0.1,  nan,  0.5),
       ( 0.1,  nan,  nan)], 
      dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))

Just had a similar problem when exporting from dataframe to arcgis table and stumbled on a solution from usgs (https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table). In short your problem has a similar solution:

df

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

np_data = np.array(np.rec.fromrecords(df.values))
np_names = df.dtypes.index.tolist()
np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])

np_data

array([( nan,  0.2,  nan), ( nan,  nan,  0.5), ( nan,  0.2,  0.5),
       ( 0.1,  0.2,  nan), ( 0.1,  0.2,  0.5), ( 0.1,  nan,  0.5),
       ( 0.1,  nan,  nan)], 
      dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))

回答 11

我经历了以上答案。“ as_matrix() ”方法可以使用,但是现在已经过时了。对我来说,有效的是“ .to_numpy() ”。

这将返回一个多维数组。如果您要从excel工作表中读取数据,并且需要从任何索引访问数据,则我会更喜欢使用此方法。希望这可以帮助 :)

I went through the answers above. The “as_matrix()” method works but its obsolete now. For me, What worked was “.to_numpy()“.

This returns a multidimensional array. I’ll prefer using this method if you’re reading data from excel sheet and you need to access data from any index. Hope this helps :)


回答 12

除了陨石的答案,我找到了代码

df.index = df.index.astype('i8')

对我不起作用。因此,我将代码放在这里,以方便陷入此问题的其他人。

city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
# the field 'city_en' is a string, when converted to Numpy array, it will be an object
city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
descr=city_cluster_arr.dtype.descr
# change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
descr[1]=(descr[1][0], "S20")
newArr=city_cluster_arr.astype(np.dtype(descr))

Further to meteore’s answer, I found the code

df.index = df.index.astype('i8')

doesn’t work for me. So I put my code here for the convenience of others stuck with this issue.

city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
# the field 'city_en' is a string, when converted to Numpy array, it will be an object
city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
descr=city_cluster_arr.dtype.descr
# change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
descr[1]=(descr[1][0], "S20")
newArr=city_cluster_arr.astype(np.dtype(descr))

回答 13

将数据帧转换为numpy数组的简单方法:

import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df_to_array = df.to_numpy()
array([[1, 3],
   [2, 4]])

鼓励使用to_numpy来保持一致性。

参考:https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html

A simple way to convert dataframe to numpy array:

import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df_to_array = df.to_numpy()
array([[1, 3],
   [2, 4]])

Use of to_numpy is encouraged to preserve consistency.

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html


回答 14

尝试这个:

np.array(df) 

array([['ID', nan, nan, nan],
   ['1', nan, 0.2, nan],
   ['2', nan, nan, 0.5],
   ['3', nan, 0.2, 0.5],
   ['4', 0.1, 0.2, nan],
   ['5', 0.1, 0.2, 0.5],
   ['6', 0.1, nan, 0.5],
   ['7', 0.1, nan, nan]], dtype=object)

有关更多信息,请访问:[ https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html] 对numpy 1.16.5和pandas 0.25.2有效。

Try this:

np.array(df) 

array([['ID', nan, nan, nan],
   ['1', nan, 0.2, nan],
   ['2', nan, nan, 0.5],
   ['3', nan, 0.2, 0.5],
   ['4', 0.1, 0.2, nan],
   ['5', 0.1, 0.2, 0.5],
   ['6', 0.1, nan, 0.5],
   ['7', 0.1, nan, nan]], dtype=object)

Some more information at: [https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html] Valid for numpy 1.16.5 and pandas 0.25.2.


创建一个空的Pandas DataFrame,然后填充它?

问题:创建一个空的Pandas DataFrame,然后填充它?

我从这里的pandas DataFrame文档开始:http ://pandas.pydata.org/pandas-docs/stable/dsintro.html

我想用时间序列类型的计算中的值迭代地填充DataFrame。所以基本上,我想用列A,B和时间戳记行(全为0或全部为NaN)初始化DataFrame。

然后,我将添加初始值,然后遍历此数据,计算出大约某行之前的新行row[A][t] = row[A][t-1]+1

我目前正在使用下面的代码,但是我觉得这很丑陋,必须有一种直接使用DataFrame进行此操作的方法,或者通常来说是一种更好的方法。注意:我正在使用Python 2.7。

import datetime as dt
import pandas as pd
import scipy as s

if __name__ == '__main__':
    base = dt.datetime.today().date()
    dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
    dates.sort()

    valdict = {}
    symbols = ['A','B', 'C']
    for symb in symbols:
        valdict[symb] = pd.Series( s.zeros( len(dates)), dates )

    for thedate in dates:
        if thedate > dates[0]:
            for symb in valdict:
                valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]

    print valdict

I’m starting from the pandas DataFrame docs here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html

I’d like to iteratively fill the DataFrame with values in a time series kind of calculation. So basically, I’d like to initialize the DataFrame with columns A, B and timestamp rows, all 0 or all NaN.

I’d then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.

I’m currently using the code as below, but I feel it’s kind of ugly and there must be a way to do this with a DataFrame directly, or just a better way in general. Note: I’m using Python 2.7.

import datetime as dt
import pandas as pd
import scipy as s

if __name__ == '__main__':
    base = dt.datetime.today().date()
    dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
    dates.sort()

    valdict = {}
    symbols = ['A','B', 'C']
    for symb in symbols:
        valdict[symb] = pd.Series( s.zeros( len(dates)), dates )

    for thedate in dates:
        if thedate > dates[0]:
            for symb in valdict:
                valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]

    print valdict

回答 0

这里有一些建议:

使用date_range的指标:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

注意:我们可以NaN简单地通过编写以下内容来创建一个空的DataFrame(带有s):

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

要对数据进行这些类型的计算,请使用numpy数组:

data = np.array([np.arange(10)]*3).T

因此,我们可以创建DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9

Here’s a couple of suggestions:

Use date_range for the index:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

Note: we could create an empty DataFrame (with NaNs) simply by writing:

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

To do these type of calculations for the data, use a numpy array:

data = np.array([np.arange(10)]*3).T

Hence we can create the DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9

回答 1

如果您只想创建一个空的数据框并在以后用一些传入的数据框填充它,请尝试以下操作:

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional 

在此示例中,我使用此pandas文档创建一个新的数据框,然后使用append将oldDF中的数据写入newDF。

如果我必须不断地将来自多个旧DF的新数据追加到此newDF中,则只需使用for循环即可遍历 pandas.DataFrame.append()

If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional 

In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.

If I have to keep appending new data into this newDF from more than one oldDFs, I just use a for loop to iterate over pandas.DataFrame.append()


回答 2

创建数据框的正确方法

TLDR;(只需阅读粗体文字)

这里的大多数答案将告诉您如何创建一个空的DataFrame并将其填写,但是没有人会告诉您这是一件坏事。

这是我的建议:等待直到您确定拥有所有需要使用的数据。使用列表收集数据,然后在准备好时初始化DataFrame。

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

一次添加到列表并创建一个DataFrame总是比创建一个空的DataFrame(或NaN之一)便宜,一遍又一遍地添加到列表列表还占用较少的内存,并且可以轻松处理,追加和删除(如果需要)的数据结构

此方法的另一个优点是dtypes可以自动推断(而不是分配object给所有对象)。

最后一个优点是为您的数据自动创建了aRangeIndex,因此不必担心(只需查看下面的劣势appendloc方法,您将在两种方法中看到需要正确处理索引的元素)。


你不应该做的事情

appendconcat在循环内

这是我从初学者看到的最大错误:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

内存重新分配给每一个appendconcat你有操作。再加上一个循环,就可以进行二次复杂度运算。从df.append文档页面

迭代地将行添加到DataFrame可能比单个连接更多地占用大量计算资源。更好的解决方案是将这些行添加到列表中,然后一次将列表与原始DataFrame连接起来。

与之相关的另一个错误df.append是用户倾向于忘记append不是就地函数,因此必须将结果分配回去。您还必须担心dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

处理对象列从来都不是一件好事,因为熊猫无法向量化这些列上的操作。您将需要执行以下操作来修复它:

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

loc 循环内

我还曾经看到过loc将其追加到创建为空的DataFrame上:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

和以前一样,您没有每次都预先分配所需的内存量,因此每次创建新行时都会重新增加内存。就像一样糟糕append,甚至更难看。

NaN的空数据框

然后,创建一个NaN的DataFrame以及与此相关的所有警告。

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

它会像其他对象一样创建一个对象列的DataFrame。

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

如上所述,追加仍然存在所有问题。

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]

证明在布丁中

对这些方法进行计时是最快的方法,以了解它们在内存和实用性方面的差异。

在此处输入图片说明

基准测试代码,以供参考。

The Right Way™ to Create a DataFrame

TLDR; (just read the bold text)

Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.

Here is my advice: Wait until you are sure you have all the data you need to work with. Use a list to collect your data, then initialise a DataFrame when you are ready.

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of of NaNs) and append to it over and over again. Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).

The other advantage of this method is dtypes are automatically inferred (rather than assigning object to all of them).

The last advantage is that a RangeIndex is automatically created for your data, so it is one less thing to worry about (take a look at the poor append and loc methods below, you will see elements in both that require handling the index appropriately).


Things you should NOT do

append or concat inside a loop

Here is the biggest mistake I’ve seen from beginners:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation. From the df.append doc page:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

loc inside a loop

I have also seen loc used to append to a DataFrame that was created empty:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It’s just as bad as append, and even more ugly.

Empty DataFrame of NaNs

And then, there’s creating a DataFrame of NaNs, and all the caveats associated therewith.

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

It creates a DataFrame of object columns, like the others.

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

Appending still has all the issues as the methods above.

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.


回答 3

用列名初始化空框架

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

将新记录添加到框架

my_df.loc[len(my_df)] = [2, 4, 5]

您可能还想通过字典:

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic 

将另一个框架附加到现有框架

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)

性能考量

如果要在循环内添加行,请考虑性能问题。对于大约前1000条记录,“ my_df.loc”的性能较好,但通过增加循环中的记录数,它的性能逐渐变慢。

如果您打算在一个大循环中进行细化处理(例如10M‌条记录左右),那么最好将这两种方法混合使用;用iloc填充数据框,直到大小达到1000,然后将其附加到原始数据框,然后清空临时数据框。这将使您的性能提高大约10倍。

Initialize empty frame with column names

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

Add a new record to a frame

my_df.loc[len(my_df)] = [2, 4, 5]

You also might want to pass a dictionary:

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic 

Append another frame to your existing frame

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)

Performance considerations

If you are adding rows inside a loop consider performance issues. For around the first 1000 records “my_df.loc” performance is better, but it gradually becomes slower by increasing the number of records in the loop.

If you plan to do thins inside a big loop (say 10M‌ records or so), you are better off using a mixture of these two; fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe. This would boost your performance by around 10 times.


回答 4

假设有19行的数据框

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

保持A列不变

test['A']=10

保持列b为循环给出的变量

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

您可以将第一个x替换为pd.Series([x], index = [x])任何值

Assume a dataframe with 19 rows

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

Keeping Column A as a constant

test['A']=10

Keeping column b as a variable given by a loop

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

You can replace the first x in pd.Series([x], index = [x]) with any value


如何计算pandas DataFrame列中的NaN值

问题:如何计算pandas DataFrame列中的NaN值

我有数据,我想在其中找到数量NaN,以便如果它小于某个阈值,我将删除此列。我看了一下,但是找不到任何功能。有value_counts,但对我来说会很慢,因为大多数值是不同的,并且我只想计数NaN

I have data, in which I want to find number of NaN, so that if it is less than some threshold, I will drop this columns. I looked, but didn’t able to find any function for this. there is value_counts, but it would be slow for me, because most of values are distinct and I want count of NaN only.


回答 0

您可以使用该isna()方法(或者它的别名isnull()也与<0.21.0的旧版熊猫兼容),然后求和以计算NaN值。对于一列:

In [1]: s = pd.Series([1,2,3, np.nan, np.nan])

In [4]: s.isna().sum()   # or s.isnull().sum() for older pandas versions
Out[4]: 2

对于几列,它也适用:

In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

In [6]: df.isna().sum()
Out[6]:
a    1
b    2
dtype: int64

You can use the isna() method (or it’s alias isnull() which is also compatible with older pandas versions < 0.21.0) and then sum to count the NaN values. For one column:

In [1]: s = pd.Series([1,2,3, np.nan, np.nan])

In [4]: s.isna().sum()   # or s.isnull().sum() for older pandas versions
Out[4]: 2

For several columns, it also works:

In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

In [6]: df.isna().sum()
Out[6]:
a    1
b    2
dtype: int64

回答 1

您可以从非Nan值的计数中减去总长度:

count_nan = len(df) - df.count()

您应该在数据上计时。与isnull解决方案相比,小型系列的速度提高了3倍。

You could subtract the total length from the count of non-nan values:

count_nan = len(df) - df.count()

You should time it on your data. For small Series got a 3x speed up in comparison with the isnull solution.


回答 2

假设df是一个熊猫DataFrame。

然后,

df.isnull().sum(axis = 0)

这将在每列中提供NaN值的数量。

如果需要,可以在每行中输入NaN值,

df.isnull().sum(axis = 1)

Lets assume df is a pandas DataFrame.

Then,

df.isnull().sum(axis = 0)

This will give number of NaN values in every column.

If you need, NaN values in every row,

df.isnull().sum(axis = 1)

回答 3

根据投票最多的答案,我们可以轻松定义一个函数,该函数为我们提供一个数据框,以预览每列中的缺失值和缺失值的百分比:

def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

Based on the most voted answer we can easily define a function that gives us a dataframe to preview the missing values and the % of missing values in each column:

def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

回答 4

从熊猫0.14.1开始,我在这里建议 value_counts方法中使用关键字参数:

import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
for col in df:
    print df[col].value_counts(dropna=False)

2     1
 1     1
NaN    1
dtype: int64
NaN    2
 1     1
dtype: int64

Since pandas 0.14.1 my suggestion here to have a keyword argument in the value_counts method has been implemented:

import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
for col in df:
    print df[col].value_counts(dropna=False)

2     1
 1     1
NaN    1
dtype: int64
NaN    2
 1     1
dtype: int64

回答 5

如果它只是在熊猫列中计算nan值,这是一种快速方法

import pandas as pd
## df1 as an example data frame 
## col1 name of column for which you want to calculate the nan values
sum(pd.isnull(df1['col1']))

if its just counting nan values in a pandas column here is a quick way

import pandas as pd
## df1 as an example data frame 
## col1 name of column for which you want to calculate the nan values
sum(pd.isnull(df1['col1']))

回答 6

如果您正在使用Jupyter Notebook,如何…。

 %%timeit
 df.isnull().any().any()

要么

 %timeit 
 df.isnull().values.sum()

或者,数据中是否存在NaN,如果是,在哪里?

 df.isnull().any()

if you are using Jupyter Notebook, How about….

 %%timeit
 df.isnull().any().any()

or

 %timeit 
 df.isnull().values.sum()

or, are there anywhere NaNs in the data, if yes, where?

 df.isnull().any()

回答 7

下面将按降序打印所有Nan列。

df.isnull().sum().sort_values(ascending = False)

要么

下面将按降序打印前15 Nan列。

df.isnull().sum().sort_values(ascending = False).head(15)

The below will print all the Nan columns in descending order.

df.isnull().sum().sort_values(ascending = False)

or

The below will print first 15 Nan columns in descending order.

df.isnull().sum().sort_values(ascending = False).head(15)

回答 8

import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])

results 
'''
  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0
'''

您可以使用以下功能,这将在Dataframe中提供输出

  • 零值
  • 缺失值
  • 占总价值的百分比
  • 总零缺失值
  • 总零缺失值百分比
  • 数据类型

只需复制并粘贴以下函数,然后通过传递您的pandas Dataframe来调用它

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

输出量

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

如果要保持简单,则可以使用以下函数获取%的缺失值

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)
'''
Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64
'''
import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])

results 
'''
  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0
'''

You can use following function, which will give you output in Dataframe

  • Zero Values
  • Missing Values
  • % of Total Values
  • Total Zero Missing Values
  • % Total Zero Missing Values
  • Data Type

Just copy and paste following function and call it by passing your pandas Dataframe

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

Output

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

If you want to keep it simple then you can use following function to get missing values in %

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)
'''
Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64
'''

回答 9

计数零:

df[df == 0].count(axis=0)

要计算NaN:

df.isnull().sum()

要么

df.isna().sum()

To count zeroes:

df[df == 0].count(axis=0)

To count NaN:

df.isnull().sum()

or

df.isna().sum()

回答 10

您可以使用value_counts方法并打印np.nan的值

s.value_counts(dropna = False)[np.nan]

You can use value_counts method and print values of np.nan

s.value_counts(dropna = False)[np.nan]

回答 11

请在下面使用特定的列数

dataframe.columnName.isnull().sum()

Please use below for particular column count

dataframe.columnName.isnull().sum()

回答 12

df1.isnull().sum()

这将达到目的。

df1.isnull().sum()

This will do the trick.


回答 13

这是用于按Null列计算值的代码:

df.isna().sum()

Here is the code for counting Null values column wise :

df.isna().sum()

回答 14

2017年7月有一篇不错的Dzone文章,其中详细介绍了总结NaN值的各种方法。检查它在这里

我引用的文章通过以下方式提供了附加值:(1)显示一种计数和显示每一列的NaN计数的方法,以便人们可以轻松地决定是否丢弃这些列,以及(2)演示一种在其中选择那些行的方法具有NaN的特定分子,因此可以有选择地丢弃或估算它们。

这是一个演示该方法实用性的简单示例-仅用几列,也许它的用处并不明显,但我发现它对较大的数据帧有帮助。

import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# Check whether there are null values in columns
null_columns = df.columns[df.isnull().any()]
print(df[null_columns].isnull().sum())

# One can follow along further per the cited article

There is a nice Dzone article from July 2017 which details various ways of summarising NaN values. Check it out here.

The article I have cited provides additional value by: (1) Showing a way to count and display NaN counts for every column so that one can easily decide whether or not to discard those columns and (2) Demonstrating a way to select those rows in specific which have NaNs so that they may be selectively discarded or imputed.

Here’s a quick example to demonstrate the utility of the approach – with only a few columns perhaps its usefulness is not obvious but I found it to be of help for larger data-frames.

import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# Check whether there are null values in columns
null_columns = df.columns[df.isnull().any()]
print(df[null_columns].isnull().sum())

# One can follow along further per the cited article

回答 15

为了计算NaN,尚未建议的另一个简单选项是添加形状以返回带有NaN的行数。

df[df['col_name'].isnull()]['col_name'].shape

One other simple option not suggested yet, to just count NaNs, would be adding in the shape to return the number of rows with NaN.

df[df['col_name'].isnull()]['col_name'].shape

回答 16

df.isnull()。sum()将给出缺失值的列式总和。

如果您想知道特定列中缺失值的总和,则可以使用以下代码df.column.isnull()。sum()

df.isnull().sum() will give the column-wise sum of missing values.

If you want to know the sum of missing values in a particular column then following code will work df.column.isnull().sum()


回答 17

根据给出的答案和一些改进,这是我的方法

def PercentageMissin(Dataset):
    """this function will return the percentage of missing values in a dataset """
    if isinstance(Dataset,pd.DataFrame):
        adict={} #a dictionary conatin keys columns names and values percentage of missin value in the columns
        for col in Dataset.columns:
            adict[col]=(np.count_nonzero(Dataset[col].isnull())*100)/len(Dataset[col])
        return pd.DataFrame(adict,index=['% of missing'],columns=adict.keys())
    else:
        raise TypeError("can only be used with panda dataframe")

based to the answer that was given and some improvements this is my approach

def PercentageMissin(Dataset):
    """this function will return the percentage of missing values in a dataset """
    if isinstance(Dataset,pd.DataFrame):
        adict={} #a dictionary conatin keys columns names and values percentage of missin value in the columns
        for col in Dataset.columns:
            adict[col]=(np.count_nonzero(Dataset[col].isnull())*100)/len(Dataset[col])
        return pd.DataFrame(adict,index=['% of missing'],columns=adict.keys())
    else:
        raise TypeError("can only be used with panda dataframe")

回答 18

如果您需要获取groupby提取的不同组之间的非NA(non-None)和NA(None)计数:

gdf = df.groupby(['ColumnToGroupBy'])

def countna(x):
    return (x.isna()).sum()

gdf.agg(['count', countna, 'size'])

这将返回非NA,NA的计数以及每个组的条目总数。

In case you need to get the non-NA (non-None) and NA (None) counts across different groups pulled out by groupby:

gdf = df.groupby(['ColumnToGroupBy'])

def countna(x):
    return (x.isna()).sum()

gdf.agg(['count', countna, 'size'])

This returns the counts of non-NA, NA and total number of entries per group.


回答 19

在我的代码中使用了@sushmit提出的解决方案。

相同的可能变体也可以是

colNullCnt = []
for z in range(len(df1.cols)):
    colNullCnt.append([df1.cols[z], sum(pd.isnull(trainPd[df1.cols[z]]))])

这样做的好处是,此后它将返回df中每个列的结果。

Used the solution proposed by @sushmit in my code.

A possible variation of the same can also be

colNullCnt = []
for z in range(len(df1.cols)):
    colNullCnt.append([df1.cols[z], sum(pd.isnull(trainPd[df1.cols[z]]))])

Advantage of this is that it returns the result for each of the columns in the df henceforth.


回答 20

import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# count the NaNs in a column
num_nan_a = df.loc[ (pd.isna(df['a'])) , 'a' ].shape[0]
num_nan_b = df.loc[ (pd.isna(df['b'])) , 'b' ].shape[0]

# summarize the num_nan_b
print(df)
print(' ')
print(f"There are {num_nan_a} NaNs in column a")
print(f"There are {num_nan_b} NaNs in column b")

给出作为输出:

     a    b
0  1.0  NaN
1  2.0  1.0
2  NaN  NaN

There are 1 NaNs in column a
There are 2 NaNs in column b
import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# count the NaNs in a column
num_nan_a = df.loc[ (pd.isna(df['a'])) , 'a' ].shape[0]
num_nan_b = df.loc[ (pd.isna(df['b'])) , 'b' ].shape[0]

# summarize the num_nan_b
print(df)
print(' ')
print(f"There are {num_nan_a} NaNs in column a")
print(f"There are {num_nan_b} NaNs in column b")

Gives as output:

     a    b
0  1.0  NaN
1  2.0  1.0
2  NaN  NaN

There are 1 NaNs in column a
There are 2 NaNs in column b

回答 21

假设您要在称为评论的数据框中获取称为价格的列(系列)中的缺失值(NaN)数

#import the dataframe
import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

要获取缺失值,以n_missing_prices作为变量,只需执行

n_missing_prices = sum(reviews.price.isnull())
print(n_missing_prices)

sum是这里的关键方法,在我意识到sum是在这种情况下使用的正确方法之前,我曾尝试使用count

Suppose you want to get the number of missing values(NaN) in a column(series) known as price in a dataframe called reviews

#import the dataframe
import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

To get the missing values, with n_missing_prices as the variable, simple do

n_missing_prices = sum(reviews.price.isnull())
print(n_missing_prices)

sum is the key method here, was trying to use count before i realized sum is the right method to use in this context


回答 22

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.count.html#pandas.Series.count

pandas.Series.count
Series.count(level=None)[source]

返回系列中非NA /空观测值的数量

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.count.html#pandas.Series.count

pandas.Series.count
Series.count(level=None)[source]

Return number of non-NA/null observations in the Series


回答 23

对于您的任务,您可以使用pandas.DataFrame.dropna(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 2, 3, 4, np.nan],
                   'b': [1, 2, np.nan, 4, np.nan],
                   'c': [np.nan, 2, np.nan, 4, np.nan]})
df = df.dropna(axis='columns', thresh=3)

print(df)

您可以使用thresh thresh参数为DataFrame中的所有列声明NaN值的最大计数。

代码输出:

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  NaN
3  4.0  4.0
4  NaN  NaN

For your task you can use pandas.DataFrame.dropna (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 2, 3, 4, np.nan],
                   'b': [1, 2, np.nan, 4, np.nan],
                   'c': [np.nan, 2, np.nan, 4, np.nan]})
df = df.dropna(axis='columns', thresh=3)

print(df)

Whith thresh parameter you can declare the max count for NaN values for all columns in DataFrame.

Code outputs:

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  NaN
3  4.0  4.0
4  NaN  NaN

如何在熊猫数据框的列中将所有NaN值替换为零

问题:如何在熊猫数据框的列中将所有NaN值替换为零

我有一个数据框如下

      itm Date                  Amount 
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN

当我尝试将一个函数应用于“金额”列时,出现以下错误。

ValueError: cannot convert float NaN to integer

我已经尝试过使用数学模块中的.isnan来应用函数。我已经尝试过pandas .replace属性。我已经尝试过pandas 0.9的.sparse data属性。我还尝试过如果函数中的NaN == NaN语句。我还看了这篇文章如何在R数据帧中用零替换NA值?同时查看其他文章。我尝试过的所有方法均无效或无法识别NaN。任何提示或解决方案将不胜感激。

I have a dataframe as below

      itm Date                  Amount 
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN

when I try to .apply a function to the Amount column I get the following error.

ValueError: cannot convert float NaN to integer

I have tried applying a function using .isnan from the Math Module I have tried the pandas .replace attribute I tried the .sparse data attribute from pandas 0.9 I have also tried if NaN == NaN statement in a function. I have also looked at this article How do I replace NA values with zeros in an R dataframe? whilst looking at some other articles. All the methods I have tried have not worked or do not recognise NaN. Any Hints or solutions would be appreciated.


回答 0

我相信DataFrame.fillna()会为您做到这一点。

链接到文档以获取数据框系列

例:

In [7]: df
Out[7]: 
          0         1
0       NaN       NaN
1 -0.494375  0.570994
2       NaN       NaN
3  1.876360 -0.229738
4       NaN       NaN

In [8]: df.fillna(0)
Out[8]: 
          0         1
0  0.000000  0.000000
1 -0.494375  0.570994
2  0.000000  0.000000
3  1.876360 -0.229738
4  0.000000  0.000000

要仅将NaN填入一列,请仅选择该列。在这种情况下,我使用inplace = True实际更改df的内容。

In [12]: df[1].fillna(0, inplace=True)
Out[12]: 
0    0.000000
1    0.570994
2    0.000000
3   -0.229738
4    0.000000
Name: 1

In [13]: df
Out[13]: 
          0         1
0       NaN  0.000000
1 -0.494375  0.570994
2       NaN  0.000000
3  1.876360 -0.229738
4       NaN  0.000000

编辑:

为避免出现SettingWithCopyWarning,请使用内置的列专用功能:

df.fillna({1:0}, inplace=True)

I believe DataFrame.fillna() will do this for you.

Link to Docs for a dataframe and for a Series.

Example:

In [7]: df
Out[7]: 
          0         1
0       NaN       NaN
1 -0.494375  0.570994
2       NaN       NaN
3  1.876360 -0.229738
4       NaN       NaN

In [8]: df.fillna(0)
Out[8]: 
          0         1
0  0.000000  0.000000
1 -0.494375  0.570994
2  0.000000  0.000000
3  1.876360 -0.229738
4  0.000000  0.000000

To fill the NaNs in only one column, select just that column. in this case I’m using inplace=True to actually change the contents of df.

In [12]: df[1].fillna(0, inplace=True)
Out[12]: 
0    0.000000
1    0.570994
2    0.000000
3   -0.229738
4    0.000000
Name: 1

In [13]: df
Out[13]: 
          0         1
0       NaN  0.000000
1 -0.494375  0.570994
2       NaN  0.000000
3  1.876360 -0.229738
4       NaN  0.000000

EDIT:

To avoid a SettingWithCopyWarning, use the built in column-specific functionality:

df.fillna({1:0}, inplace=True)

回答 1

不能保证切片会返回视图或副本。你可以做

df['column'] = df['column'].fillna(value)

It is not guaranteed that the slicing returns a view or a copy. You can do

df['column'] = df['column'].fillna(value)

回答 2

您可以使用replace更改NaN0

import pandas as pd
import numpy as np

# for column
df['column'] = df['column'].replace(np.nan, 0)

# for whole dataframe
df = df.replace(np.nan, 0)

# inplace
df.replace(np.nan, 0, inplace=True)

You could use replace to change NaN to 0:

import pandas as pd
import numpy as np

# for column
df['column'] = df['column'].replace(np.nan, 0)

# for whole dataframe
df = df.replace(np.nan, 0)

# inplace
df.replace(np.nan, 0, inplace=True)

回答 3

我只是想提供一些更新/特殊情况,因为看起来人们仍然来这里。如果您使用的是多索引或以其他方式使用索引切片器,则inplace = True选项可能不足以更新您选择的切片。例如,在2×2级多索引中,这不会更改任何值(从熊猫0.15开始):

idx = pd.IndexSlice
df.loc[idx[:,mask_1],idx[mask_2,:]].fillna(value=0,inplace=True)

“问题”是链接中断了fillna更新原始数据帧的能力。我将“问题”用引号引起来,因为设计决策有充分的理由导致在某些情况下无法通过这些链条进行解释。同样,这是一个复杂的示例(尽管我确实遇到过),但是根据切片的方式,同样的情况可能适用于较少级别的索引。

解决方案是DataFrame.update:

df.update(df.loc[idx[:,mask_1],idx[[mask_2],:]].fillna(value=0))

这是一行,读起来相当好(某种),并消除了中间变量或循环的不必要混乱,同时允许您将fillna应用于所需的任何多层次切片!

如果有人可以找到行不通的地方,请在评论中发帖,我一直在弄乱它并查看源代码,它似乎至少解决了我的多索引切片问题。

I just wanted to provide a bit of an update/special case since it looks like people still come here. If you’re using a multi-index or otherwise using an index-slicer the inplace=True option may not be enough to update the slice you’ve chosen. For example in a 2×2 level multi-index this will not change any values (as of pandas 0.15):

idx = pd.IndexSlice
df.loc[idx[:,mask_1],idx[mask_2,:]].fillna(value=0,inplace=True)

The “problem” is that the chaining breaks the fillna ability to update the original dataframe. I put “problem” in quotes because there are good reasons for the design decisions that led to not interpreting through these chains in certain situations. Also, this is a complex example (though I really ran into it), but the same may apply to fewer levels of indexes depending on how you slice.

The solution is DataFrame.update:

df.update(df.loc[idx[:,mask_1],idx[[mask_2],:]].fillna(value=0))

It’s one line, reads reasonably well (sort of) and eliminates any unnecessary messing with intermediate variables or loops while allowing you to apply fillna to any multi-level slice you like!

If anybody can find places this doesn’t work please post in the comments, I’ve been messing with it and looking at the source and it seems to solve at least my multi-index slice problems.


回答 4

下面的代码为我工作。

import pandas

df = pandas.read_csv('somefile.txt')

df = df.fillna(0)

The below code worked for me.

import pandas

df = pandas.read_csv('somefile.txt')

df = df.fillna(0)

回答 5

填充缺失值的简单方法:

填充 字符串列:当字符串列具有缺失值和NaN值时。

df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)

填充 数字列:当数字列缺少值和NaN值时。

df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)

用零填充NaN:

df['column name'].fillna(0, inplace = True)

Easy way to fill the missing values:-

filling string columns: when string columns have missing values and NaN values.

df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)

filling numeric columns: when the numeric columns have missing values and NaN values.

df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)

filling NaN with zero:

df['column name'].fillna(0, inplace = True)

回答 6

您还可以使用字典来填充DataFrame中特定列的NaN值,而不是使用某个oneValue来填充所有DF。

import pandas as pd

df = pd.read_excel('example.xlsx')
df.fillna( {
        'column1': 'Write your values here',
        'column2': 'Write your values here',
        'column3': 'Write your values here',
        'column4': 'Write your values here',
        .
        .
        .
        'column-n': 'Write your values here'} , inplace=True)

You can also use dictionaries to fill NaN values of the specific columns in the DataFrame rather to fill all the DF with some oneValue.

import pandas as pd

df = pd.read_excel('example.xlsx')
df.fillna( {
        'column1': 'Write your values here',
        'column2': 'Write your values here',
        'column3': 'Write your values here',
        'column4': 'Write your values here',
        .
        .
        .
        'column-n': 'Write your values here'} , inplace=True)

回答 7

在此处输入图片说明

考虑到Amount上表中的特定列是整数类型。以下是一个解决方案:

df['Amount'] = df.Amount.fillna(0).astype(int)

同样,你可以用不同的数据类型,如填充它floatstr等等。

特别是,我会考虑使用数据类型来比较同一列的各种值。

enter image description here

Considering the particular column Amount in the above table is of integer type. The following would be a solution :

df['Amount'] = df.Amount.fillna(0).astype(int)

Similarly, you can fill it with various data types like float, str and so on.

In particular, I would consider datatype to compare various values of the same column.


回答 8

替换熊猫中的na值

df['column_name'].fillna(value_to_be_replaced,inplace=True)

如果为inplace = False,则不更新df(数据帧),而是返回修改后的值。

To replace na values in pandas

df['column_name'].fillna(value_to_be_replaced,inplace=True)

if inplace = False, instead of updating the df (dataframe) it will return the modified values.


回答 9

如果要将其转换为pandas数据框,也可以使用来完成此操作fillna

import numpy as np
df=np.array([[1,2,3, np.nan]])

import pandas as pd
df=pd.DataFrame(df)
df.fillna(0)

这将返回以下内容:

     0    1    2   3
0  1.0  2.0  3.0 NaN
>>> df.fillna(0)
     0    1    2    3
0  1.0  2.0  3.0  0.0

If you were to convert it to a pandas dataframe, you can also accomplish this by using fillna.

import numpy as np
df=np.array([[1,2,3, np.nan]])

import pandas as pd
df=pd.DataFrame(df)
df.fillna(0)

This will return the following:

     0    1    2   3
0  1.0  2.0  3.0 NaN
>>> df.fillna(0)
     0    1    2    3
0  1.0  2.0  3.0  0.0

回答 10

主要有两个选项:插补或填充缺失值的情况下NaN / np.nan,仅数字替换(跨列:

df['Amount'].fillna(value=None, method= ,axis=1,) 足够了:

从文档中:

value:标量,dict,Series或DataFrame用于填充孔的值(例如0),或者是dict / Series / DataFrame的值,这些值指定每个索引(对于Series)或列(对于DataFrame)使用哪个值。(不在dict / Series / DataFrame中的值将不被填充)。该值不能是列表。

这意味着不再允许对“字符串”或“常量”进行插补。

对于更专业的插补,请使用SimpleImputer()

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])

There are two options available primarily; in case of imputation or filling of missing values NaN / np.nan with only numerical replacements (across column(s):

df['Amount'].fillna(value=None, method= ,axis=1,) is sufficient:

From the Documentation:

value : scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

Which means ‘strings’ or ‘constants’ are no longer permissable to be imputed.

For more specialized imputations use SimpleImputer():

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])


回答 11

用不同的方式替换不同列中的nan:

   replacement= {'column_A': 0, 'column_B': -999, 'column_C': -99999}
   df.fillna(value=replacement)

To replace nan in different columns with different ways:

   replacement= {'column_A': 0, 'column_B': -999, 'column_C': -99999}
   df.fillna(value=replacement)