如何在熊猫数据框中将单元格设置为NaN

问题:如何在熊猫数据框中将单元格设置为NaN

我想用NaN替换数据框列中的错误值。

mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']}
df = pd.DataFrame(mydata)

df[df.y == 'N/A']['y'] = np.nan

虽然,最后一行失败,并发出警告,因为它正在处理df副本。那么,处理此问题的正确方法是什么?我已经见过许多使用iloc或ix的解决方案,但是在这里,我需要使用布尔条件。

I’d like to replace bad values in a column of a dataframe by NaN’s.

mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']}
df = pd.DataFrame(mydata)

df[df.y == 'N/A']['y'] = np.nan

Though, the last line fails and throws a warning because it’s working on a copy of df. So, what’s the correct way to handle this? I’ve seen many solutions with iloc or ix but here, I need to use a boolean condition.


回答 0

只需使用replace

In [106]:
df.replace('N/A',np.NaN)

Out[106]:
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

您正在尝试的操作称为链索引:http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

您可以loc用来确保对原始dF进行操作:

In [108]:
df.loc[df['y'] == 'N/A','y'] = np.nan
df

Out[108]:
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

just use replace:

In [106]:
df.replace('N/A',np.NaN)

Out[106]:
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

What you’re trying is called chain indexing: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

You can use loc to ensure you operate on the original dF:

In [108]:
df.loc[df['y'] == 'N/A','y'] = np.nan
df

Out[108]:
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

回答 1

虽然使用replace似乎可以解决问题,但我想提出一种替代方法。列中数字和某些字符串值混合的问题不是用np.nan替换字符串,而是使整个列正确。我敢打赌,原始列很可能是对象类型

Name: y, dtype: object

您真正需要的是使它成为一个数字列(它将具有适当的类型,并且速度会更快),并且所有非数字值都将替换为NaN。

因此,良好的转换代码将是

pd.to_numeric(df['y'], errors='coerce')

指定errors='coerce'强制将无法解析为数字值的字符串变为NaN。列类型为

Name: y, dtype: float64

While using replace seems to solve the problem, I would like to propose an alternative. Problem with mix of numeric and some string values in the column not to have strings replaced with np.nan, but to make whole column proper. I would bet that original column most likely is of an object type

Name: y, dtype: object

What you really need is to make it a numeric column (it will have proper type and would be quite faster), with all non-numeric values replaced by NaN.

Thus, good conversion code would be

pd.to_numeric(df['y'], errors='coerce')

Specify errors='coerce' to force strings that can’t be parsed to a numeric value to become NaN. Column type would be

Name: y, dtype: float64

回答 2

您可以使用replace:

df['y'] = df['y'].replace({'N/A': np.nan})

另请注意的inplace参数replace。您可以执行以下操作:

df.replace({'N/A': np.nan}, inplace=True)

这将替换df中的所有实例,而不创建副本。

同样,如果遇到其他类型的未知值,例如空字符串或无值:

df['y'] = df['y'].replace({'': np.nan})

df['y'] = df['y'].replace({None: np.nan})

参考:熊猫最新-替换

You can use replace:

df['y'] = df['y'].replace({'N/A': np.nan})

Also be aware of the inplace parameter for replace. You can do something like:

df.replace({'N/A': np.nan}, inplace=True)

This will replace all instances in the df without creating a copy.

Similarly, if you run into other types of unknown values such as empty string or None value:

df['y'] = df['y'].replace({'': np.nan})

df['y'] = df['y'].replace({None: np.nan})

Reference: Pandas Latest – Replace


回答 3

df.loc[df.y == 'N/A',['y']] = np.nan

这样可以解决您的问题。使用double [],您正在处理DataFrame的副本。您必须在一个呼叫中指定确切位置才能进行修改。

df.loc[df.y == 'N/A',['y']] = np.nan

This solve your problem. With the double [], you are working on a copy of the DataFrame. You have to specify exact location in one call to be able to modify it.


回答 4

您可以尝试这些片段。

在[16]:mydata = {'x':[10,50,18,32,47,20],'y':['12','11','N / A','13',' 15','N / A']}
在[17]:df = pd.DataFrame(mydata)

在[18]:df.y [df.y ==“ N / A”] = np.nan

出[19]:df 
    y
0 10 12
1 50 11
2 18 NaN
3 32 13
4 47 15
5 20 NaN

You can try these snippets.

In [16]:mydata = {'x' : [10, 50, 18, 32, 47, 20], 'y' : ['12', '11', 'N/A', '13', '15', 'N/A']}
In [17]:df=pd.DataFrame(mydata)

In [18]:df.y[df.y=="N/A"]=np.nan

Out[19]:df 
    x    y
0  10   12
1  50   11
2  18  NaN
3  32   13
4  47   15
5  20  NaN

回答 5

从pandas 1.0.0开始,您不再需要使用numpy在数据框中创建空值。相反,您只能使用pandas.NA(类型为pandas._libs.missing.NAType),因此它将在数据帧内被视为null,但在数据帧上下文之外将不被视为null。

As of pandas 1.0.0, you no longer need to use numpy to create null values in your dataframe. Instead you can just use pandas.NA (which is of type pandas._libs.missing.NAType), so it will be treated as null within the dataframe but will not be null outside dataframe context.