在熊猫中用NaN替换空白值(空白)

问题:在熊猫中用NaN替换空白值(空白)

我想在包含空格(任意数量)的Pandas数据框中找到所有值,并用NaN替换这些值。

有什么想法可以改善吗?

基本上我想把这个:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux     

变成这个:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

我已经用下面的代码做到了,但是这很丑。这不是Pythonic,而且我敢肯定,这也不是最有效的熊猫使用方式。我遍历每一列,并对通过应用对每个值进行正则表达式搜索(在空格上匹配)的函数生成的列掩码进行布尔替换。

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

通过仅迭代可能包含空字符串的字段,可以对它进行一些优化:

if df[i].dtype == np.dtype('object')

但这并没有太大的改善

最后,此代码将目标字符串设置为None,该字符串可与Pandas的like函数一起使用fillna(),但是如果我实际上可以NaN直接插入而不是,那么这样做对完整性很有帮助None

I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.

Any ideas how this can be improved?

Basically I want to turn this:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux     

Into this:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

I’ve managed to do it with the code below, but man is it ugly. It’s not Pythonic and I’m sure it’s not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

It could be optimized a bit by only iterating through fields that could contain empty strings:

if df[i].dtype == np.dtype('object')

But that’s not much of an improvement

And finally, this code sets the target strings to None, which works with Pandas’ functions like fillna(), but it would be nice for completeness if I could actually insert a NaN directly instead of None.


回答 0

我认为可以df.replace()做到,因为熊猫0.13

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))

生成:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

正如Temak指出的那样,请df.replace(r'^\s+$', np.nan, regex=True)在您的有效数据包含空格的情况下使用。

I think df.replace() does the job, since pandas 0.13:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))

Produces:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

As Temak pointed it out, use df.replace(r'^\s+$', np.nan, regex=True) in case your valid data contains white spaces.


回答 1

如果要替换空字符串并仅用空格记录,则正确答案是!:

df = df.replace(r'^\s*$', np.nan, regex=True)

接受的答案

df.replace(r'\s+', np.nan, regex=True)

不替换空字符串!,您可以尝试使用稍作更新的示例进行尝试:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

请注意,尽管’fo o’包含空格,但并未用Nan代替。进一步注意,这很简单:

df.replace(r'', np.NaN)

也不起作用-试试看。

If you want to replace an empty string and records with only spaces, the correct answer is!:

df = df.replace(r'^\s*$', np.nan, regex=True)

The accepted answer

df.replace(r'\s+', np.nan, regex=True)

Does not replace an empty string!, you can try yourself with the given example slightly updated:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Note, also that ‘fo o’ is not replaced with Nan, though it contains a space. Further note, that a simple:

df.replace(r'', np.NaN)

Does not work either – try it out.


回答 2

怎么样:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

applymap函数将一个函数应用于数据帧的每个单元。

How about:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

The applymap function applies a function to every cell of the dataframe.


回答 3

我将这样做:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

要么

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

您可以剥离所有str,然后将空str替换为np.nan

I will did this:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

or

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

You can strip all str, then replace empty str with np.nan.


回答 4

所有解决方案中最简单的:

df = df.replace(r'^\s+$', np.nan, regex=True)

Simplest of all solutions:

df = df.replace(r'^\s+$', np.nan, regex=True)

回答 5

如果要从CSV文件导出数据,则可以像这样简单:

df = pd.read_csv(file_csv, na_values=' ')

这将创建数据框并将空白值替换为Na

If you are exporting the data from the CSV file it can be as simple as this :

df = pd.read_csv(file_csv, na_values=' ')

This will create the data frame as well as replace blank values as Na


回答 6

对于一个非常快速,简单的解决方案,您可以根据一个值检查是否相等,可以使用该mask方法。

df.mask(df == ' ')

For a very fast and simple solution where you check equality against a single value, you can use the mask method.

df.mask(df == ' ')

回答 7

这些都是接近正确答案的方法,但是我不会说任何解决问题的方法,同时让其他人仍然最容易阅读您的代码。我会说答案是BrenBarn的答案和tuomasttik在该答案下方的评论的结合。BrenBarn的答案利用了isspace内置函数,但不支持按照OP的要求删除空字符串,而我倾向于将其归为用null替换字符串的标准用例。

我用重写了它.apply,因此可以在pd.Series或上调用它pd.DataFrame


Python 3:

替换空字符串或整个空格的字符串:

df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)

要替换整个空格字符串:

df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)

要在Python 2中使用此代码,您需要替换strbasestring

Python 2:

替换空字符串或整个空格的字符串:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)

要替换整个空格字符串:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

These are all close to the right answer, but I wouldn’t say any solve the problem while remaining most readable to others reading your code. I’d say that answer is a combination of BrenBarn’s Answer and tuomasttik’s comment below that answer. BrenBarn’s answer utilizes isspace builtin, but does not support removing empty strings, as OP requested, and I would tend to attribute that as the standard use case of replacing strings with null.

I rewrote it with .apply, so you can call it on a pd.Series or pd.DataFrame.


Python 3:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)

To use this in Python 2, you’ll need to replace str with basestring.

Python 2:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

回答 8

这对我有用。导入csv文件时,我添加了na_values =”。默认的NaN值中不包含空格。

df = pd.read_csv(filepath,na_values =”)

This worked for me. When I import my csv file I added na_values = ‘ ‘. Spaces are not included in the default NaN values.

df= pd.read_csv(filepath,na_values = ‘ ‘)


回答 9

您还可以使用过滤器来执行此操作。

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

you can also use a filter to do it.

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

回答 10

print(df.isnull().sum()) # check numbers of null value in each column

modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"

# modifiedDf = fd.dropna() # Remove rows with empty values

print(modifiedDf.isnull().sum()) # check numbers of null value in each column
print(df.isnull().sum()) # check numbers of null value in each column

modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"

# modifiedDf = fd.dropna() # Remove rows with empty values

print(modifiedDf.isnull().sum()) # check numbers of null value in each column

回答 11

这不是一个很好的解决方案,但是似乎有效的方法是保存到XLSX,然后将其重新导入。不确定为什么,此页面上的其他解决方案对我不起作用。

data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)

This is not an elegant solution, but what does seem to work is saving to XLSX and then importing it back. The other solutions on this page did not work for me, unsure why.

data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)