问题:如何用熊猫DataFrame中的先前值替换NaN?

假设我有一个带有NaNs 的DataFrame :

>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
    0   1   2
0   1   2   3
1   4 NaN NaN
2 NaN NaN   9

我需要做的是用上面同一列中NaN的第一个非NaN值替换每个值。假设第一行永远不会包含NaN。因此,对于前面的示例,结果将是

   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

我可以遍历整个DataFrame的逐列,逐元素并直接设置值,但是是否有一种简单的方法(最佳无循环)来实现呢?

Suppose I have a DataFrame with some NaNs:

>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
    0   1   2
0   1   2   3
1   4 NaN NaN
2 NaN NaN   9

What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be

   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?


回答 0

您可以在DataFrame上使用该方法,并将该方法指定为ffill(正向填充):

>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

这个方法

将上一个有效观察结果传播到下一个有效观察结果

相反,还有一个 bfill方法。

此方法不会就地修改DataFrame-您需要将返回的DataFrame重新绑定到变量,或者指定inplace=True

df.fillna(method='ffill', inplace=True)

You could use the method on the DataFrame and specify the method as ffill (forward fill):

>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

This method…

propagate[s] last valid observation forward to next valid

To go the opposite way, there’s also a bfill method.

This method doesn’t modify the DataFrame inplace – you’ll need to rebind the returned DataFrame to a variable or else specify inplace=True:

df.fillna(method='ffill', inplace=True)

回答 1

公认的答案是完美的。我遇到了一个相关但略有不同的情况,我必须向前填写,但只能在小组中填写。如果有人有相同的需求,请知道fillna可用于DataFrameGroupBy对象。

>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
  name  number
0    a     0.0
1    a     1.0
2    a     2.0
3    b     NaN
4    b     4.0
5    b     NaN
6    c     6.0
7    c     7.0
8    c     8.0
9    c     9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0    0.0
1    1.0
2    2.0
3    NaN
4    4.0
5    4.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: number, dtype: float64

The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.

>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
  name  number
0    a     0.0
1    a     1.0
2    a     2.0
3    b     NaN
4    b     4.0
5    b     NaN
6    c     6.0
7    c     7.0
8    c     8.0
9    c     9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0    0.0
1    1.0
2    2.0
3    NaN
4    4.0
5    4.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: number, dtype: float64

回答 2

您可以使用method='ffill'选项。'ffill'代表“向前填充”,并将向前传播最后一个有效观察值。替代方法是'bfill'相同的方法,但倒退。

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')

print(df)
#   0  1  2
#0  1  2  3
#1  4  2  3
#2  4  2  9

为此,还有一个直接的同义词功能pandas.DataFrame.ffill,可以简化操作。

You can use with the method='ffill' option. 'ffill' stands for ‘forward fill’ and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')

print(df)
#   0  1  2
#0  1  2  3
#1  4  2  3
#2  4  2  9

There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.


回答 3

我在尝试此解决方案时注意到的一件事是,如果您在数组的开头或结尾处都没有N / A,则填充和填充将无法正常工作。你们两个都需要。

In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])

In [225]: df.ffill()
Out[225]:
     0
0  NaN
1  1.0
...
7  6.0
8  6.0

In [226]: df.bfill()
Out[226]:
     0
0  1.0
1  1.0
...
7  6.0
8  NaN

In [227]: df.bfill().ffill()
Out[227]:
     0
0  1.0
1  1.0
...
7  6.0
8  6.0

One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don’t quite work. You need both.

In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])

In [225]: df.ffill()
Out[225]:
     0
0  NaN
1  1.0
...
7  6.0
8  6.0

In [226]: df.bfill()
Out[226]:
     0
0  1.0
1  1.0
...
7  6.0
8  NaN

In [227]: df.bfill().ffill()
Out[227]:
     0
0  1.0
1  1.0
...
7  6.0
8  6.0

回答 4

ffill 现在有自己的方法 pd.DataFrame.ffill

df.ffill()

     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

ffill now has it’s own method pd.DataFrame.ffill

df.ffill()

     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

回答 5

仅一列版本

  • 最后一个有效值填充NAN
df[column_name].fillna(method='ffill', inplace=True)
  • 下一个有效值填充NAN
df[column_name].fillna(method='backfill', inplace=True)

Only one column version

  • Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
  • Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)

回答 6

只是同意ffillmethod,但是一个额外的信息是您可以使用关键字arguments限制正向填充limit

>>> import pandas as pd    
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])

>>> df
     0    1   2
0  1.0  2.0   3
1  NaN  NaN   6
2  NaN  NaN   9

>>> df[1].fillna(method='ffill', inplace=True)
>>> df
     0    1    2
0  1.0  2.0    3
1  NaN  2.0    6
2  NaN  2.0    9

现在带有limit关键字参数

>>> df[0].fillna(method='ffill', limit=1, inplace=True)

>>> df
     0    1  2
0  1.0  2.0  3
1  1.0  2.0  6
2  NaN  2.0  9

Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.

>>> import pandas as pd    
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])

>>> df
     0    1   2
0  1.0  2.0   3
1  NaN  NaN   6
2  NaN  NaN   9

>>> df[1].fillna(method='ffill', inplace=True)
>>> df
     0    1    2
0  1.0  2.0    3
1  NaN  2.0    6
2  NaN  2.0    9

Now with limit keyword argument

>>> df[0].fillna(method='ffill', limit=1, inplace=True)

>>> df
     0    1  2
0  1.0  2.0  3
1  1.0  2.0  6
2  NaN  2.0  9

回答 7

就我而言,我们有来自不同设备的时间序列,但是某些设备在一段时间内无法发送任何值。因此,我们应该为每个设备和时间段创建NA值,然后再执行fillna。

df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')

结果:

        0   1   value
0   device1     1   first val of device1
1   device1     2   first val of device1
2   device1     3   first val of device1
3   device2     1   None
4   device2     2   first val of device2
5   device2     3   first val of device2
6   device3     1   None
7   device3     2   None
8   device3     3   first val of device3

In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.

df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')

Result:

        0   1   value
0   device1     1   first val of device1
1   device1     2   first val of device1
2   device1     3   first val of device1
3   device2     1   None
4   device2     2   first val of device2
5   device2     3   first val of device2
6   device3     1   None
7   device3     2   None
8   device3     3   first val of device3

回答 8

您可以fillna用来删除或替换NaN值。

NaN 移除

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])

df.fillna(method='ffill')
     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

NaN 替换

df.fillna(0) # 0 means What Value you want to replace 
     0    1    2
0  1.0  2.0  3.0
1  4.0  0.0  0.0
2  0.0  0.0  9.0

参考pandas.DataFrame.fillna

You can use fillna to remove or replace NaN values.

NaN Remove

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])

df.fillna(method='ffill')
     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

NaN Replace

df.fillna(0) # 0 means What Value you want to replace 
     0    1    2
0  1.0  2.0  3.0
1  4.0  0.0  0.0
2  0.0  0.0  9.0

Reference pandas.DataFrame.fillna


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。