问题:sklearn错误ValueError:输入包含NaN,无穷大或对于dtype(’float64’)而言太大的值
我正在使用sklearn,并且亲和力传播存在问题。我建立了一个输入矩阵,并且不断收到以下错误。
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
我跑了
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
我尝试使用
mat[np.isfinite(mat) == True] = 0
删除无限值,但这也不起作用。我该怎么做才能摆脱矩阵中的无限值,以便可以使用亲和力传播算法?
我正在使用anaconda和python 2.7.9。
I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I have run
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
I tried using
mat[np.isfinite(mat) == True] = 0
to remove the infinite values but this did not work either.
What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?
I am using anaconda and python 2.7.9.
回答 0
这可能会在scikit内部发生,并且取决于您在做什么。我建议您阅读所用功能的文档。您可能正在使用一种方法,例如,这取决于您的矩阵是正定的且不满足该条件。
编辑:我怎么会错过:
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
显然是错误的。正确的是:
np.any(np.isnan(mat))
和
np.all(np.isfinite(mat))
您想检查任何元素是否为NaN,而不是该any
函数的返回值是否为数字…
This might happen inside scikit, and it depends on what you’re doing. I recommend reading the documentation for the functions you’re using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.
EDIT: How could I miss that:
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
is obviously wrong. Right would be:
np.any(np.isnan(mat))
and
np.all(np.isfinite(mat))
You want to check wheter any of the element is NaN, and not whether the return value of the any
function is a number…
回答 1
将sklearn与pandas一起使用时,出现相同的错误消息。我的解决方案是df
在运行任何sklearn代码之前重置数据帧的索引:
df = df.reset_index()
当我删除自己的某些条目时,我多次遇到此问题df
,例如
df = df[df.label=='desired_one']
I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df
before running any sklearn code:
df = df.reset_index()
I encountered this issue many times when I removed some entries in my df
, such as
df = df[df.label=='desired_one']
回答 2
这是我的功能(基于此)清洁的数据集nan
,Inf
和缺少细胞(偏斜数据集):
import pandas as pd
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
return df[indices_to_keep].astype(np.float64)
This is my function (based on this) to clean the dataset of nan
, Inf
, and missing cells (for skewed datasets):
import pandas as pd
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
return df[indices_to_keep].astype(np.float64)
回答 3
The Dimensions of my input array were skewed, as my input csv had empty spaces.
回答 4
这是失败的检查:
哪说
def _assert_all_finite(X):
"""Like assert_all_finite, but only for ndarray."""
X = np.asanyarray(X)
# First try an O(n) time, O(1) space solution for the common case that
# everything is finite; fall back to O(n) space np.isfinite to prevent
# false positives from overflow in sum method.
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
and not np.isfinite(X).all()):
raise ValueError("Input contains NaN, infinity"
" or a value too large for %r." % X.dtype)
因此,请确保输入中没有非NaN值。所有这些值实际上都是浮点值。两个值都不应该是Inf。
This is the check on which it fails:
Which says
def _assert_all_finite(X):
"""Like assert_all_finite, but only for ndarray."""
X = np.asanyarray(X)
# First try an O(n) time, O(1) space solution for the common case that
# everything is finite; fall back to O(n) space np.isfinite to prevent
# false positives from overflow in sum method.
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
and not np.isfinite(X).all()):
raise ValueError("Input contains NaN, infinity"
" or a value too large for %r." % X.dtype)
So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.
回答 5
使用此版本的python 3:
/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)
查看错误的详细信息,我发现导致失败的代码行:
/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
56 and not np.isfinite(X).all()):
57 raise ValueError("Input contains NaN, infinity"
---> 58 " or a value too large for %r." % X.dtype)
59
60
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
由此,我能够使用错误消息给出的相同测试来提取正确的方法来测试数据的处理方式: np.isfinite(X)
然后通过快速而肮脏的循环,我发现我的数据确实包含nans
:
print(p[:,0].shape)
index = 0
for i in p[:,0]:
if not np.isfinite(i):
print(index, i)
index +=1
(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...
现在,我要做的就是删除这些索引中的值。
With this version of python 3:
/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)
Looking at the details of the error, I found the lines of codes causing the failure:
/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
56 and not np.isfinite(X).all()):
57 raise ValueError("Input contains NaN, infinity"
---> 58 " or a value too large for %r." % X.dtype)
59
60
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)
Then with a quick and dirty loop, I was able to find that my data indeed contains nans
:
print(p[:,0].shape)
index = 0
for i in p[:,0]:
if not np.isfinite(i):
print(index, i)
index +=1
(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...
Now all I have to do is remove the values at these indexes.
回答 6
尝试选择行的子集后出现错误:
df = df.reindex(index=my_index)
原来my_index
包含的值不包含在其中df.index
,因此reindex函数插入了一些新行并将其填充为nan
。
I had the error after trying to select a subset of rows:
df = df.reindex(index=my_index)
Turns out that my_index
contained values that were not contained in df.index
, so the reindex function inserted some new rows and filled them with nan
.
回答 7
在大多数情况下,消除无穷和空值可以解决此问题。
摆脱无限的价值。
df.replace([np.inf, -np.inf], np.nan, inplace=True)
以您喜欢的方式消除空值,特定值(例如999),均值,或创建自己的函数来估算缺失值
df.fillna(999, inplace=True)
In most cases getting rid of infinite and null values solve this problem.
get rid of infinite values.
df.replace([np.inf, -np.inf], np.nan, inplace=True)
get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values
df.fillna(999, inplace=True)
回答 8
我有同样的错误,在我的案例中,X和y是数据帧,因此我必须先将它们转换为矩阵:
X = X.values.astype(np.float)
y = y.values.astype(np.float)
编辑:不建议使用最初建议的X.as_matrix()
I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:
X = X.values.astype(np.float)
y = y.values.astype(np.float)
Edit: The originally suggested X.as_matrix() is Deprecated
回答 9
我有同样的错误。它曾与df.fillna(-99999, inplace=True)
做任何替换之前,替换等
i got the same error. it worked with df.fillna(-99999, inplace=True)
before doing any replacement, substitution etc
回答 10
就我而言,问题是许多scikit函数返回的numpy数组没有熊猫索引。因此,当我使用这些numpy数组构建新的DataFrames,然后尝试将它们与原始数据混合时,索引不匹配。
In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.
回答 11
删除所有无限值:
(并用该列的min或max代替)
# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]
# go through matrix one column at a time and replace + and -infinity
# with the max or min for that column
for i in range(log_train_arr.shape[1]):
matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
matrix[:, i][matrix[:, i] == np.inf] = maxs[i]
Remove all infinite values:
(and replace with min or max for that column)
# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]
# go through matrix one column at a time and replace + and -infinity
# with the max or min for that column
for i in range(log_train_arr.shape[1]):
matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
matrix[:, i][matrix[:, i] == np.inf] = maxs[i]
回答 12
尝试
mat.sum()
如果您的数据总和为无穷大(最大浮动值大于3.402823e + 38),则会收到该错误。
请参阅scikit源代码中validation.py中的_assert_all_finite函数:
if is_float and np.isfinite(X.sum()):
pass
elif is_float:
msg_err = "Input contains {} or a value too large for {!r}."
if (allow_nan and np.isinf(X).any() or
not allow_nan and not np.isfinite(X).all()):
type_err = 'infinity' if allow_nan else 'NaN, infinity'
# print(X.sum())
raise ValueError(msg_err.format(type_err, X.dtype))
try
mat.sum()
If the sum of your data is infinity (greater that the max float value which is 3.402823e+38) you will get that error.
see the _assert_all_finite function in validation.py from the scikit source code:
if is_float and np.isfinite(X.sum()):
pass
elif is_float:
msg_err = "Input contains {} or a value too large for {!r}."
if (allow_nan and np.isinf(X).any() or
not allow_nan and not np.isfinite(X).all()):
type_err = 'infinity' if allow_nan else 'NaN, infinity'
# print(X.sum())
raise ValueError(msg_err.format(type_err, X.dtype))