问题:在熊猫中用NaN替换空白值(空白)
我想在包含空格(任意数量)的Pandas数据框中找到所有值,并用NaN替换这些值。
有什么想法可以改善吗?
基本上我想把这个:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz
2000-01-05 -0.222552 4
2000-01-06 -1.176781 qux
变成这个:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
我已经用下面的代码做到了,但是这很丑。这不是Pythonic,而且我敢肯定,这也不是最有效的熊猫使用方式。我遍历每一列,并对通过应用对每个值进行正则表达式搜索(在空格上匹配)的函数生成的列掩码进行布尔替换。
for i in df.columns:
df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None
通过仅迭代可能包含空字符串的字段,可以对它进行一些优化:
if df[i].dtype == np.dtype('object')
但这并没有太大的改善
最后,此代码将目标字符串设置为None,该字符串可与Pandas的like函数一起使用fillna()
,但是如果我实际上可以NaN
直接插入而不是,那么这样做对完整性很有帮助None
。
I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.
Any ideas how this can be improved?
Basically I want to turn this:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz
2000-01-05 -0.222552 4
2000-01-06 -1.176781 qux
Into this:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
I’ve managed to do it with the code below, but man is it ugly. It’s not Pythonic and I’m sure it’s not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.
for i in df.columns:
df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None
It could be optimized a bit by only iterating through fields that could contain empty strings:
if df[i].dtype == np.dtype('object')
But that’s not much of an improvement
And finally, this code sets the target strings to None, which works with Pandas’ functions like fillna()
, but it would be nice for completeness if I could actually insert a NaN
directly instead of None
.
回答 0
我认为可以df.replace()
做到,因为熊猫0.13:
df = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '],
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))
生成:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
正如Temak指出的那样,请df.replace(r'^\s+$', np.nan, regex=True)
在您的有效数据包含空格的情况下使用。
I think df.replace()
does the job, since pandas 0.13:
df = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '],
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))
Produces:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
As Temak pointed it out, use df.replace(r'^\s+$', np.nan, regex=True)
in case your valid data contains white spaces.
回答 1
如果要替换空字符串并仅用空格记录,则正确答案是!:
df = df.replace(r'^\s*$', np.nan, regex=True)
接受的答案
df.replace(r'\s+', np.nan, regex=True)
不替换空字符串!,您可以尝试使用稍作更新的示例进行尝试:
df = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'fo o', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ''],
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
请注意,尽管’fo o’包含空格,但并未用Nan代替。进一步注意,这很简单:
df.replace(r'', np.NaN)
也不起作用-试试看。
If you want to replace an empty string and records with only spaces, the correct answer is!:
df = df.replace(r'^\s*$', np.nan, regex=True)
The accepted answer
df.replace(r'\s+', np.nan, regex=True)
Does not replace an empty string!, you can try yourself with the given example slightly updated:
df = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'fo o', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ''],
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
Note, also that ‘fo o’ is not replaced with Nan, though it contains a space. Further note, that a simple:
df.replace(r'', np.NaN)
Does not work either – try it out.
回答 2
怎么样:
d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)
该applymap
函数将一个函数应用于数据帧的每个单元。
How about:
d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)
The applymap
function applies a function to every cell of the dataframe.
回答 3
我将这样做:
df = df.apply(lambda x: x.str.strip()).replace('', np.nan)
要么
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
您可以剥离所有str,然后将空str替换为np.nan
。
I will did this:
df = df.apply(lambda x: x.str.strip()).replace('', np.nan)
or
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
You can strip all str, then replace empty str with np.nan
.
回答 4
所有解决方案中最简单的:
df = df.replace(r'^\s+$', np.nan, regex=True)
Simplest of all solutions:
df = df.replace(r'^\s+$', np.nan, regex=True)
回答 5
如果要从CSV文件导出数据,则可以像这样简单:
df = pd.read_csv(file_csv, na_values=' ')
这将创建数据框并将空白值替换为Na
If you are exporting the data from the CSV file it can be as simple as this :
df = pd.read_csv(file_csv, na_values=' ')
This will create the data frame as well as replace blank values as Na
回答 6
对于一个非常快速,简单的解决方案,您可以根据一个值检查是否相等,可以使用该mask
方法。
df.mask(df == ' ')
For a very fast and simple solution where you check equality against a single value, you can use the mask
method.
df.mask(df == ' ')
回答 7
这些都是接近正确答案的方法,但是我不会说任何解决问题的方法,同时让其他人仍然最容易阅读您的代码。我会说答案是BrenBarn的答案和tuomasttik在该答案下方的评论的结合。BrenBarn的答案利用了isspace
内置函数,但不支持按照OP的要求删除空字符串,而我倾向于将其归为用null替换字符串的标准用例。
我用重写了它.apply
,因此可以在pd.Series
或上调用它pd.DataFrame
。
Python 3:
替换空字符串或整个空格的字符串:
df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)
要替换整个空格字符串:
df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)
要在Python 2中使用此代码,您需要替换str
为basestring
。
Python 2:
替换空字符串或整个空格的字符串:
df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)
要替换整个空格字符串:
df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)
These are all close to the right answer, but I wouldn’t say any solve the problem while remaining most readable to others reading your code. I’d say that answer is a combination of BrenBarn’s Answer and tuomasttik’s comment below that answer. BrenBarn’s answer utilizes isspace
builtin, but does not support removing empty strings, as OP requested, and I would tend to attribute that as the standard use case of replacing strings with null.
I rewrote it with .apply
, so you can call it on a pd.Series
or pd.DataFrame
.
Python 3:
To replace empty strings or strings of entirely spaces:
df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)
To replace strings of entirely spaces:
df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)
To use this in Python 2, you’ll need to replace str
with basestring
.
Python 2:
To replace empty strings or strings of entirely spaces:
df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)
To replace strings of entirely spaces:
df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)
回答 8
这对我有用。导入csv文件时,我添加了na_values =”。默认的NaN值中不包含空格。
df = pd.read_csv(filepath,na_values =”)
This worked for me. When I import my csv file I added na_values = ‘ ‘. Spaces are not included in the default NaN values.
df= pd.read_csv(filepath,na_values = ‘ ‘)
回答 9
您还可以使用过滤器来执行此操作。
df = PD.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '])
df[df=='']='nan'
df=df.astype(float)
you can also use a filter to do it.
df = PD.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '])
df[df=='']='nan'
df=df.astype(float)
回答 10
print(df.isnull().sum()) # check numbers of null value in each column
modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"
# modifiedDf = fd.dropna() # Remove rows with empty values
print(modifiedDf.isnull().sum()) # check numbers of null value in each column
print(df.isnull().sum()) # check numbers of null value in each column
modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"
# modifiedDf = fd.dropna() # Remove rows with empty values
print(modifiedDf.isnull().sum()) # check numbers of null value in each column
回答 11
这不是一个很好的解决方案,但是似乎有效的方法是保存到XLSX,然后将其重新导入。不确定为什么,此页面上的其他解决方案对我不起作用。
data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)
This is not an elegant solution, but what does seem to work is saving to XLSX and then importing it back. The other solutions on this page did not work for me, unsure why.
data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。