标签归档:numpy

从sklearn导入时出现ImportError:无法导入名称check_build

问题:从sklearn导入时出现ImportError:无法导入名称check_build

尝试从sklearn导入时出现以下错误:

>>> from sklearn import svm

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
   from sklearn import svm
  File "C:\Python27\lib\site-packages\sklearn\__init__.py", line 16, in <module>
   from . import check_build
ImportError: cannot import name check_build

我正在使用python 2.7,scipy-0.12.0b1 superpack,numpy-1.6.0 superpack,scikit-learn-0.11我有一台Windows 7机器

我已经检查了几个有关此问题的答案,但没有一个给出解决此错误的方法。

I am getting the following error while trying to import from sklearn:

>>> from sklearn import svm

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
   from sklearn import svm
  File "C:\Python27\lib\site-packages\sklearn\__init__.py", line 16, in <module>
   from . import check_build
ImportError: cannot import name check_build

I am using python 2.7, scipy-0.12.0b1 superpack, numpy-1.6.0 superpack, scikit-learn-0.11 I have a windows 7 machine

I have checked several answers for this issue but none of them gives a way out of this error.


回答 0

安装scipy后为我工作。

Worked for me after installing scipy.


回答 1

>>> from sklearn import preprocessing, metrics, cross_validation

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    from sklearn import preprocessing, metrics, cross_validation
  File "D:\Python27\lib\site-packages\sklearn\__init__.py", line 31, in <module>
    from . import __check_build
ImportError: cannot import name __check_build
>>> ================================ RESTART ================================
>>> from sklearn import preprocessing, metrics, cross_validation
>>> 

因此,只需尝试重新启动Shell!

>>> from sklearn import preprocessing, metrics, cross_validation

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    from sklearn import preprocessing, metrics, cross_validation
  File "D:\Python27\lib\site-packages\sklearn\__init__.py", line 31, in <module>
    from . import __check_build
ImportError: cannot import name __check_build
>>> ================================ RESTART ================================
>>> from sklearn import preprocessing, metrics, cross_validation
>>> 

So, simply try to restart the shell!


回答 2

我针对Python 3.6.5 64位Windows 10的解决方案:

  1. pip uninstall sklearn
  2. pip uninstall scikit-learn
  3. pip install sklearn

无需重新启动命令行,但是您可以根据需要执行此操作。我花了一天的时间来修复此错误。希望能有所帮助。

My solution for Python 3.6.5 64-bit Windows 10:

  1. pip uninstall sklearn
  2. pip uninstall scikit-learn
  3. pip install sklearn

No need to restart command-line but you can do this if you want. It took me one day to fix this bug. Hope this help.


回答 3

安装numpyscipysklearn 仍然有错误

解:

设置PathPython的系统变量和PYTHONPATH环境变量

系统变量:添加C:\Python34到路径中用户变量:添加新:(名称)PYTHONPATH(值)C:\Python34\Lib\site-packages;

After installing numpy , scipy ,sklearn still has error

Solution:

Setting Up System Path Variable for Python & the PYTHONPATH Environment Variable

System Variables: add C:\Python34 into path User Variables: add new: (name)PYTHONPATH (value)C:\Python34\Lib\site-packages;


回答 4

通常,当我遇到此类错误时,打开__init__.py文件并四处浏览会有所帮助。转到目录,C:\Python27\lib\site-packages\sklearn并确保首先有一个子目录__check_build。在我的机器(有工作sklearn安装,Mac OSX版,Python的2.7.3)我有__init__.pysetup.py及其相关的.pyc文件和二进制_check_build.so

闲逛在__init__.py该目录中,我会采取下一步行动就是去sklearn/__init__.py进出import语句评论—只是检查,事情被正确编译check_build的东西,它似乎并没有做任何事情,但调用预编译二进制 当然,这需要您自担风险,而且(肯定)可以解决。如果构建失败,您可能很快就会遇到其他更大的问题。

Usually when I get these kinds of errors, opening the __init__.py file and poking around helps. Go to the directory C:\Python27\lib\site-packages\sklearn and ensure that there’s a sub-directory called __check_build as a first step. On my machine (with a working sklearn installation, Mac OSX, Python 2.7.3) I have __init__.py, setup.py, their associated .pyc files, and a binary _check_build.so.

Poking around the __init__.py in that directory, the next step I’d take is to go to sklearn/__init__.py and comment out the import statement—the check_build stuff just checks that things were compiled correctly, it doesn’t appear to do anything but call a precompiled binary. This is, of course, at your own risk, and (to be sure) a work around. If your build failed you’ll likely soon run into other, bigger problems.


回答 5

我在Windows上遇到了同样的问题。通过安装numpy的+ MKL解决它http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy(有它的建议依赖于它的其他软件包之前安装numpy的+ MKL)通过建议这个答案

I had the same issue on Windows. Solved it by installing Numpy+MKL from http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy (there it’s recommended to install numpy+mkl before other packages that depend on it) as suggested by this answer.


回答 6

从python.org安装新的64位版本的Python 3.4后,导入SKLEARN时遇到问题。

原来是SCIPY模块损坏了,当我尝试“导入scipy”时alos失败了。

解决方案是卸载scipy并使用pip3重新安装它:

C:\> pip uninstall scipy

[lots of reporting messages deleted]

Proceed (y/n)? y
  Successfully uninstalled scipy-1.0.0

C:\Users\>pip3 install scipy

Collecting scipy
  Downloading scipy-1.0.0-cp36-none-win_amd64.whl (30.8MB)
    100% |████████████████████████████████| 30.8MB 33kB/s
Requirement already satisfied: numpy>=1.8.2 in c:\users\johnmccurdy\appdata\loca
l\programs\python\python36\lib\site-packages (from scipy)
Installing collected packages: scipy
Successfully installed scipy-1.0.0

C:\Users>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)]
 on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy
>>>
>>> import sklearn
>>>

I had problems importing SKLEARN after installing a new 64bit version of Python 3.4 from python.org.

Turns out that it was the SCIPY module that was broken, and alos failed when I tried to “import scipy”.

Solution was to uninstall scipy and reinstall it with pip3:

C:\> pip uninstall scipy

[lots of reporting messages deleted]

Proceed (y/n)? y
  Successfully uninstalled scipy-1.0.0

C:\Users\>pip3 install scipy

Collecting scipy
  Downloading scipy-1.0.0-cp36-none-win_amd64.whl (30.8MB)
    100% |████████████████████████████████| 30.8MB 33kB/s
Requirement already satisfied: numpy>=1.8.2 in c:\users\johnmccurdy\appdata\loca
l\programs\python\python36\lib\site-packages (from scipy)
Installing collected packages: scipy
Successfully installed scipy-1.0.0

C:\Users>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)]
 on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy
>>>
>>> import sklearn
>>>

回答 7

如果您使用Anaconda 2.7 64位,请尝试

conda upgrade scikit-learn

并重新启动python shell,对我有用。

当我遇到相同的问题并解决时,进行第二次编辑:

conda upgrade scikit-learn

对我也有用

If you use Anaconda 2.7 64 bit, try

conda upgrade scikit-learn

and restart the python shell, that works for me.

Second edit when I faced the same problem and solved it:

conda upgrade scikit-learn

also works for me


回答 8

没有其他答案对我有用。经过一番修补后,我卸载了sklearn:

pip uninstall sklearn

然后我从这里删除了sklearn文件夹:(调整系统和python版本的路径)

C:\Users\%USERNAME%\AppData\Roaming\Python\Python36\site-packages

并从此站点通过转轮安装了它:链接

该错误在那里可能是由于与其他地方安装的sklearn版本冲突。

None of the other answers worked for me. After some tinkering I unsinstalled sklearn:

pip uninstall sklearn

Then I removed sklearn folder from here: (adjust the path to your system and python version)

C:\Users\%USERNAME%\AppData\Roaming\Python\Python36\site-packages

And the installed it from wheel from this site: link

The error was there probably because of a version conflict with sklearn installed somewhere else.


回答 9

对我来说,我是通过使用最新的python版本(3.7)从全新安装Anaconda来将现有代码升级为新设置的,为此,

from sklearn import cross_validation, 
from sklearn.grid_search import GridSearchCV

from sklearn.model_selection import GridSearchCV,cross_validate

For me, I was upgrading the existing code into new setup by installing Anaconda from fresh with latest python version(3.7) For this,

from sklearn import cross_validation, 
from sklearn.grid_search import GridSearchCV

to

from sklearn.model_selection import GridSearchCV,cross_validate

回答 10

无需卸载然后重新安装sklearn

试试这个:

from sklearn.model_selection import train_test_split

no need to uninstall & then re-install sklearn

try this:

from sklearn.model_selection import train_test_split

回答 11

我在重新安装anaconda时遇到了同样的问题,为我解决了这个问题

i had the same problem reinstalling anaconda solved the issue for me


回答 12

在Windows中:

我试图从外壳中删除sklearn:pip卸载sklearn,然后重新安装它,但不起作用..

解决方案:

1- open the cmd shell.
2- cd c:\pythonVERSION\scripts
3- pip uninstall sklearn
4- open in the explorer: C:\pythonVERSION\Lib\site-packages
5- look for the folders that contains sklearn and delete them ..
6- back to cmd: pip install sklearn

In windows:

I tried to delete sklearn from the shell: pip uninstall sklearn, and re install it but doesn’t work ..

the solution:

1- open the cmd shell.
2- cd c:\pythonVERSION\scripts
3- pip uninstall sklearn
4- open in the explorer: C:\pythonVERSION\Lib\site-packages
5- look for the folders that contains sklearn and delete them ..
6- back to cmd: pip install sklearn

当期望一维数组时,传递了列向量y

问题:当期望一维数组时,传递了列向量y

我需要适应RandomForestRegressorsklearn.ensemble

forest = ensemble.RandomForestRegressor(**RF_tuned_parameters)
model = forest.fit(train_fold, train_y)
yhat = model.predict(test_fold)

该代码始终有效,直到我对数据进行了一些预处理(train_y)。错误消息显示:

DataConversionWarning:当期望1d数组时,传递了列向量y。请将y的形状更改为(n_samples,),例如使用ravel()。

模型= forest.fit(train_fold,train_y)

以前train_y是一个Series,现在是numpy数组(它是列向量)。如果我应用train_y.ravel(),则它成为行向量,并且没有错误消息出现,通过预测步骤需要很长时间(实际上,它永远不会完成…)。

RandomForestRegressor我发现的文档中,train_y应将其定义为“ y : array-like, shape = [n_samples] or [n_samples, n_outputs] 如何解决此问题的想法?”。

I need to fit RandomForestRegressor from sklearn.ensemble.

forest = ensemble.RandomForestRegressor(**RF_tuned_parameters)
model = forest.fit(train_fold, train_y)
yhat = model.predict(test_fold)

This code always worked until I made some preprocessing of data (train_y). The error message says:

DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

model = forest.fit(train_fold, train_y)

Previously train_y was a Series, now it’s numpy array (it is a column-vector). If I apply train_y.ravel(), then it becomes a row vector and no error message appears, through the prediction step takes very long time (actually it never finishes…).

In the docs of RandomForestRegressor I found that train_y should be defined as y : array-like, shape = [n_samples] or [n_samples, n_outputs] Any idea how to solve this issue?


回答 0

更改此行:

model = forest.fit(train_fold, train_y)

至:

model = forest.fit(train_fold, train_y.values.ravel())

编辑:

.values将给出数组中的值。(形状:[n,1)

.ravel 会将数组形状转换为(n,)

Change this line:

model = forest.fit(train_fold, train_y)

to:

model = forest.fit(train_fold, train_y.values.ravel())

Edit:

.values will give the values in an array. (shape: (n,1)

.ravel will convert that array shape to (n, )


回答 1

当我尝试训练KNN分类器时,我也遇到了这种情况。但似乎在警告不见了,我改变之后:
knn.fit(X_train,y_train)

knn.fit(X_train, np.ravel(y_train,order='C'))

在这行之前,我用过import numpy as np

I also encountered this situation when I was trying to train a KNN classifier. but it seems that the warning was gone after I changed:
knn.fit(X_train,y_train)
to
knn.fit(X_train, np.ravel(y_train,order='C'))

Ahead of this line I used import numpy as np.


回答 2

我有同样的问题。问题在于标签是按列格式的,而它却希望连续显示。用np.ravel()

knn.score(training_set, np.ravel(training_labels))

希望这能解决。

I had the same problem. The problem was that the labels were in a column format while it expected it in a row. use np.ravel()

knn.score(training_set, np.ravel(training_labels))

Hope this solves it.


回答 3

使用以下代码:

model = forest.fit(train_fold, train_y.ravel())

如果仍然像下面一样错误地打耳光?

Unknown label type: %r" % y

使用此代码:

y = train_y.ravel()
train_y = np.array(y).astype(int)
model = forest.fit(train_fold, train_y)

use below code:

model = forest.fit(train_fold, train_y.ravel())

if you are still getting slap by error as identical as below ?

Unknown label type: %r" % y

use this code:

y = train_y.ravel()
train_y = np.array(y).astype(int)
model = forest.fit(train_fold, train_y)

回答 4

另一种方法是使用 ravel

model = forest.fit(train_fold, train_y.values.reshape(-1,))

Another way of doing this is to use ravel

model = forest.fit(train_fold, train_y.values.reshape(-1,))

回答 5

使用neuraxle,您可以轻松解决此问题:

p = Pipeline([
   # expected outputs shape: (n, 1)
   OutputTransformerWrapper(NumpyRavel()), 
   # expected outputs shape: (n, )
   RandomForestRegressor(**RF_tuned_parameters)
])

p, outputs = p.fit_transform(data_inputs, expected_outputs)

Neuraxle是类似于sklearn的框架,用于深度学习项目中的超参数调整和AutoML!

With neuraxle, you can easily solve this :

p = Pipeline([
   # expected outputs shape: (n, 1)
   OutputTransformerWrapper(NumpyRavel()), 
   # expected outputs shape: (n, )
   RandomForestRegressor(**RF_tuned_parameters)
])

p, outputs = p.fit_transform(data_inputs, expected_outputs)

Neuraxle is a sklearn-like framework for hyperparameter tuning and AutoML in deep learning projects !


回答 6

format_train_y=[]
for n in train_y:
    format_train_y.append(n[0])
format_train_y=[]
for n in train_y:
    format_train_y.append(n[0])

回答 7

Y = y.values [:,0]

Y-formated_train_y

y-train_y

Y = y.values[:,0]

Y – formated_train_y

y – train_y


在一维Numpy数组中使用Numpy查找局部最大值/最小值

问题:在一维Numpy数组中使用Numpy查找局部最大值/最小值

您能否建议使用numpy / scipy中的模块函数在一维numpy数组中找到局部最大值/最小值?显然,最简单的方法是看一下最近的邻居,但我想拥有一个被接受的解决方案,它是numpy发行版的一部分。

Can you suggest a module function from numpy/scipy that can find local maxima/minima in a 1D numpy array? Obviously the simplest approach ever is to have a look at the nearest neighbours, but I would like to have an accepted solution that is part of the numpy distro.


回答 0

如果您要查找一维数组中所有a小于其邻居的条目,则可以尝试

numpy.r_[True, a[1:] < a[:-1]] & numpy.r_[a[:-1] < a[1:], True]

您还可以在使用此步骤之前使数组平滑numpy.convolve()

我认为没有专用的功能。

If you are looking for all entries in the 1d array a smaller than their neighbors, you can try

numpy.r_[True, a[1:] < a[:-1]] & numpy.r_[a[:-1] < a[1:], True]

You could also smooth your array before this step using numpy.convolve().

I don’t think there is a dedicated function for this.


回答 1

在SciPy中> = 0.11

import numpy as np
from scipy.signal import argrelextrema

x = np.random.random(12)

# for local maxima
argrelextrema(x, np.greater)

# for local minima
argrelextrema(x, np.less)

产生

>>> x
array([ 0.56660112,  0.76309473,  0.69597908,  0.38260156,  0.24346445,
    0.56021785,  0.24109326,  0.41884061,  0.35461957,  0.54398472,
    0.59572658,  0.92377974])
>>> argrelextrema(x, np.greater)
(array([1, 5, 7]),)
>>> argrelextrema(x, np.less)
(array([4, 6, 8]),)

注意,这些是x的索引,它们是局部最大值/最小值。要获取值,请尝试:

>>> x[argrelextrema(x, np.greater)[0]]

scipy.signal还分别提供argrelmaxargrelmin查找最大值和最小值。

In SciPy >= 0.11

import numpy as np
from scipy.signal import argrelextrema

x = np.random.random(12)

# for local maxima
argrelextrema(x, np.greater)

# for local minima
argrelextrema(x, np.less)

Produces

>>> x
array([ 0.56660112,  0.76309473,  0.69597908,  0.38260156,  0.24346445,
    0.56021785,  0.24109326,  0.41884061,  0.35461957,  0.54398472,
    0.59572658,  0.92377974])
>>> argrelextrema(x, np.greater)
(array([1, 5, 7]),)
>>> argrelextrema(x, np.less)
(array([4, 6, 8]),)

Note, these are the indices of x that are local max/min. To get the values, try:

>>> x[argrelextrema(x, np.greater)[0]]

scipy.signal also provides argrelmax and argrelmin for finding maxima and minima respectively.


回答 2

对于噪声不太大的曲线,我建议使用以下小代码段:

from numpy import *

# example data with some peaks:
x = linspace(0,4,1e3)
data = .2*sin(10*x)+ exp(-abs(2-x)**2)

# that's the line, you need:
a = diff(sign(diff(data))).nonzero()[0] + 1 # local min+max
b = (diff(sign(diff(data))) > 0).nonzero()[0] + 1 # local min
c = (diff(sign(diff(data))) < 0).nonzero()[0] + 1 # local max


# graphical output...
from pylab import *
plot(x,data)
plot(x[b], data[b], "o", label="min")
plot(x[c], data[c], "o", label="max")
legend()
show()

+1很重要,因为diff减少了原始索引号。

For curves with not too much noise, I recommend the following small code snippet:

from numpy import *

# example data with some peaks:
x = linspace(0,4,1e3)
data = .2*sin(10*x)+ exp(-abs(2-x)**2)

# that's the line, you need:
a = diff(sign(diff(data))).nonzero()[0] + 1 # local min+max
b = (diff(sign(diff(data))) > 0).nonzero()[0] + 1 # local min
c = (diff(sign(diff(data))) < 0).nonzero()[0] + 1 # local max


# graphical output...
from pylab import *
plot(x,data)
plot(x[b], data[b], "o", label="min")
plot(x[c], data[c], "o", label="max")
legend()
show()

The +1 is important, because diff reduces the original index number.


回答 3

另一种方法(更多的单词,更少的代码)可能会有所帮助:

局部最大值和最小值的位置也是一阶导数的零交叉的位置。通常,找到零交叉比直接找到局部最大值和最小值要容易得多。

不幸的是,一阶导数往往会“放大”噪声,因此,如果原始数据中存在明显的噪声,则仅在对原始数据进行一定程度的平滑处理后,才最好使用一阶导数。

因为从最简单的意义上讲,平滑是一个低通滤波器,所以平滑通常是最好的(很好,最容易),它是使用卷积内核完成的,并且“整形”内核可以提供惊人数量的特征保留/增强功能。查找最佳内核的过程可以使用多种方法实现自动化,但最好的方法可能是简单的蛮力操作(查找小内核的速度非常快)。一个好的内核将(按预期的方式)使原始数据大量失真,但不会影响目标峰/谷的位置。

幸运的是,通常可以通过简单的SWAG(“有根据的猜测”)创建合适的内核。平滑内核的宽度应比原始数据中最宽的预期“有趣”峰稍宽一些,并且其形状将类似于该峰(单刻度小波)。对于保留均值的内核(应该有任何良好的平滑滤波器),内核元素的总和应精确等于1.00,并且内核应关于其中心对称(这意味着它将具有奇数个元素)。

给定最佳平滑内核(或为不同数据内容优化的少量内核),平滑程度就成为卷积内核(“卷积”)的缩放因子。

甚至可以自动确定“正确的”(最佳)平滑度(卷积核增益):将一阶导数数据的标准偏差与平滑数据的标准偏差进行比较。两个标准偏差的比率如何随平滑度的变化而变化,可用于预测有效的平滑值。只需要一些手动数据运行(真正具有代表性)。

上面发布的所有现有解决方案均计算一阶导数,但它们并未将其视为统计量,上述解决方案也未尝试执行特征保留/增强平滑(以帮助微妙的峰值“跨越”噪声)。

最后,一个坏消息是:当噪声还具有看起来像真实峰值(重叠带宽)的特征时,找到“真实”峰值变得很痛苦。下一个更复杂的解决方案通常是使用更长的卷积核(“更大的核孔径”),该卷积核考虑了相邻“真实”峰之间的关系(例如峰出现的最小或最大速率),或使用多个卷积使用具有不同宽度的内核传递(但前提是速度更快:这是一个基本的数学真理,按顺序执行的线性卷积始终可以一起卷积为单个卷积)。但是,通常要比一个步骤直接找到最终内核要容易得多,首先找到一系列有用的内核(宽度可变)并将它们卷积在一起。

希望这可以提供足够的信息,以使Google(也许还有不错的统计信息)能够填补空白。我真的希望我有时间提供一个可行的示例或一个示例的链接。如果有人在网上碰到过,请在此处发布!

Another approach (more words, less code) that may help:

The locations of local maxima and minima are also the locations of the zero crossings of the first derivative. It is generally much easier to find zero crossings than it is to directly find local maxima and minima.

Unfortunately, the first derivative tends to “amplify” noise, so when significant noise is present in the original data, the first derivative is best used only after the original data has had some degree of smoothing applied.

Since smoothing is, in the simplest sense, a low pass filter, the smoothing is often best (well, most easily) done by using a convolution kernel, and “shaping” that kernel can provide a surprising amount of feature-preserving/enhancing capability. The process of finding an optimal kernel can be automated using a variety of means, but the best may be simple brute force (plenty fast for finding small kernels). A good kernel will (as intended) massively distort the original data, but it will NOT affect the location of the peaks/valleys of interest.

Fortunately, quite often a suitable kernel can be created via a simple SWAG (“educated guess”). The width of the smoothing kernel should be a little wider than the widest expected “interesting” peak in the original data, and its shape will resemble that peak (a single-scaled wavelet). For mean-preserving kernels (what any good smoothing filter should be) the sum of the kernel elements should be precisely equal to 1.00, and the kernel should be symmetric about its center (meaning it will have an odd number of elements.

Given an optimal smoothing kernel (or a small number of kernels optimized for different data content), the degree of smoothing becomes a scaling factor for (the “gain” of) the convolution kernel.

Determining the “correct” (optimal) degree of smoothing (convolution kernel gain) can even be automated: Compare the standard deviation of the first derivative data with the standard deviation of the smoothed data. How the ratio of the two standard deviations changes with changes in the degree of smoothing cam be used to predict effective smoothing values. A few manual data runs (that are truly representative) should be all that’s needed.

All the prior solutions posted above compute the first derivative, but they don’t treat it as a statistical measure, nor do the above solutions attempt to performing feature preserving/enhancing smoothing (to help subtle peaks “leap above” the noise).

Finally, the bad news: Finding “real” peaks becomes a royal pain when the noise also has features that look like real peaks (overlapping bandwidth). The next more-complex solution is generally to use a longer convolution kernel (a “wider kernel aperture”) that takes into account the relationship between adjacent “real” peaks (such as minimum or maximum rates for peak occurrence), or to use multiple convolution passes using kernels having different widths (but only if it is faster: it is a fundamental mathematical truth that linear convolutions performed in sequence can always be convolved together into a single convolution). But it is often far easier to first find a sequence of useful kernels (of varying widths) and convolve them together than it is to directly find the final kernel in a single step.

Hopefully this provides enough info to let Google (and perhaps a good stats text) fill in the gaps. I really wish I had the time to provide a worked example, or a link to one. If anyone comes across one online, please post it here!


回答 4

从SciPy 1.1版开始,您还可以使用find_peaks。以下是从文档本身获取的两个示例。

使用该height参数,可以选择高于某个阈值的所有最大值(在此示例中,所有非负最大值;如果必须处理嘈杂的基线,这将非常有用;如果要查找最小值,只需将输入乘以通过-1):

import matplotlib.pyplot as plt
from scipy.misc import electrocardiogram
from scipy.signal import find_peaks
import numpy as np

x = electrocardiogram()[2000:4000]
peaks, _ = find_peaks(x, height=0)
plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.plot(np.zeros_like(x), "--", color="gray")
plt.show()

在此处输入图片说明

另一个非常有用的参数是distance,它定义了两个峰之间的最小距离:

peaks, _ = find_peaks(x, distance=150)
# difference between peaks is >= 150
print(np.diff(peaks))
# prints [186 180 177 171 177 169 167 164 158 162 172]

plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.show()

在此处输入图片说明

As of SciPy version 1.1, you can also use find_peaks. Below are two examples taken from the documentation itself.

Using the height argument, one can select all maxima above a certain threshold (in this example, all non-negative maxima; this can be very useful if one has to deal with a noisy baseline; if you want to find minima, just multiply you input by -1):

import matplotlib.pyplot as plt
from scipy.misc import electrocardiogram
from scipy.signal import find_peaks
import numpy as np

x = electrocardiogram()[2000:4000]
peaks, _ = find_peaks(x, height=0)
plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.plot(np.zeros_like(x), "--", color="gray")
plt.show()

enter image description here

Another extremely helpful argument is distance, which defines the minimum distance between two peaks:

peaks, _ = find_peaks(x, distance=150)
# difference between peaks is >= 150
print(np.diff(peaks))
# prints [186 180 177 171 177 169 167 164 158 162 172]

plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.show()

enter image description here


回答 5

为什么不使用Scipy内置函数signal.find_peaks_cwt来完成这项工作?

from scipy import signal
import numpy as np

#generate junk data (numpy 1D arr)
xs = np.arange(0, np.pi, 0.05)
data = np.sin(xs)

# maxima : use builtin function to find (max) peaks
max_peakind = signal.find_peaks_cwt(data, np.arange(1,10))

# inverse  (in order to find minima)
inv_data = 1/data
# minima : use builtin function fo find (min) peaks (use inversed data)
min_peakind = signal.find_peaks_cwt(inv_data, np.arange(1,10))

#show results
print "maxima",  data[max_peakind]
print "minima",  data[min_peakind]

结果:

maxima [ 0.9995736]
minima [ 0.09146464]

问候

Why not use Scipy built-in function signal.find_peaks_cwt to do the job ?

from scipy import signal
import numpy as np

#generate junk data (numpy 1D arr)
xs = np.arange(0, np.pi, 0.05)
data = np.sin(xs)

# maxima : use builtin function to find (max) peaks
max_peakind = signal.find_peaks_cwt(data, np.arange(1,10))

# inverse  (in order to find minima)
inv_data = 1/data
# minima : use builtin function fo find (min) peaks (use inversed data)
min_peakind = signal.find_peaks_cwt(inv_data, np.arange(1,10))

#show results
print "maxima",  data[max_peakind]
print "minima",  data[min_peakind]

results:

maxima [ 0.9995736]
minima [ 0.09146464]

Regards


回答 6

更新: 我对渐变不满意,因此发现它使用起来更可靠numpy.diff。请让我知道它是否满足您的要求。

关于噪声问题,数学问题是定位最大值/最小值,如果我们要查看噪声,可以使用前面提到的卷积之类的方法。

import numpy as np
from matplotlib import pyplot

a=np.array([10.3,2,0.9,4,5,6,7,34,2,5,25,3,-26,-20,-29],dtype=np.float)

gradients=np.diff(a)
print gradients


maxima_num=0
minima_num=0
max_locations=[]
min_locations=[]
count=0
for i in gradients[:-1]:
        count+=1

    if ((cmp(i,0)>0) & (cmp(gradients[count],0)<0) & (i != gradients[count])):
        maxima_num+=1
        max_locations.append(count)     

    if ((cmp(i,0)<0) & (cmp(gradients[count],0)>0) & (i != gradients[count])):
        minima_num+=1
        min_locations.append(count)


turning_points = {'maxima_number':maxima_num,'minima_number':minima_num,'maxima_locations':max_locations,'minima_locations':min_locations}  

print turning_points

pyplot.plot(a)
pyplot.show()

Update: I wasn’t happy with gradient so I found it more reliable to use numpy.diff. Please let me know if it does what you want.

Regarding the issue of noise, the mathematical problem is to locate maxima/minima if we want to look at noise we can use something like convolve which was mentioned earlier.

import numpy as np
from matplotlib import pyplot

a=np.array([10.3,2,0.9,4,5,6,7,34,2,5,25,3,-26,-20,-29],dtype=np.float)

gradients=np.diff(a)
print gradients


maxima_num=0
minima_num=0
max_locations=[]
min_locations=[]
count=0
for i in gradients[:-1]:
        count+=1

    if ((cmp(i,0)>0) & (cmp(gradients[count],0)<0) & (i != gradients[count])):
        maxima_num+=1
        max_locations.append(count)     

    if ((cmp(i,0)<0) & (cmp(gradients[count],0)>0) & (i != gradients[count])):
        minima_num+=1
        min_locations.append(count)


turning_points = {'maxima_number':maxima_num,'minima_number':minima_num,'maxima_locations':max_locations,'minima_locations':min_locations}  

print turning_points

pyplot.plot(a)
pyplot.show()

回答 7

虽然这个问题确实很老。我相信在numpy中使用一种简单得多的方法(一个划线员)。

import numpy as np

list = [1,3,9,5,2,5,6,9,7]

np.diff(np.sign(np.diff(list))) #the one liner

#output
array([ 0, -2,  0,  2,  0,  0, -2])

要找到局部最大值或最小值,我们本质上是想查找列表中值(3-1、9-3 …)之间的差值从正变为负(最大值)或从负变为正(最小值)。因此,首先我们发现差异。然后我们找到符号,然后通过再次求和以找到符号的变化。(类似于微积分中的一阶和二阶导数,只有我们有离散的数据,没有连续的函数。)

在我的示例中,输出不包含极值(列表中的第一个和最后一个值)。同样,与微积分一样,如果二阶导数为负,则表示最大值,如果其为正,则表示最小值。

因此,我们有以下比赛:

[1,  3,  9,  5,  2,  5,  6,  9,  7]
    [0, -2,  0,  2,  0,  0, -2]
        Max     Min         Max

While this question is really old. I believe there is a much simpler approach in numpy (a one liner).

import numpy as np

list = [1,3,9,5,2,5,6,9,7]

np.diff(np.sign(np.diff(list))) #the one liner

#output
array([ 0, -2,  0,  2,  0,  0, -2])

To find a local max or min we essentially want to find when the difference between the values in the list (3-1, 9-3…) changes from positive to negative (max) or negative to positive (min). Therefore, first we find the difference. Then we find the sign, and then we find the changes in sign by taking the difference again. (Sort of like a first and second derivative in calculus, only we have discrete data and don’t have a continuous function.)

The output in my example does not contain the extrema (the first and last values in the list). Also, just like calculus, if the second derivative is negative, you have max, and if it is positive you have a min.

Thus we have the following matchup:

[1,  3,  9,  5,  2,  5,  6,  9,  7]
    [0, -2,  0,  2,  0,  0, -2]
        Max     Min         Max

回答 8

这些解决方案都不适合我,因为我也想在重复值的中心找到峰值。例如,在

ar = np.array([0,1,2,2,2,1,3,3,3,2,5,0])

答案应该是

array([ 3,  7, 10], dtype=int64)

我使用循环来做到这一点。我知道这不是超级干净,但是可以完成工作。

def findLocalMaxima(ar):
# find local maxima of array, including centers of repeating elements    
maxInd = np.zeros_like(ar)
peakVar = -np.inf
i = -1
while i < len(ar)-1:
#for i in range(len(ar)):
    i += 1
    if peakVar < ar[i]:
        peakVar = ar[i]
        for j in range(i,len(ar)):
            if peakVar < ar[j]:
                break
            elif peakVar == ar[j]:
                continue
            elif peakVar > ar[j]:
                peakInd = i + np.floor(abs(i-j)/2)
                maxInd[peakInd.astype(int)] = 1
                i = j
                break
    peakVar = ar[i]
maxInd = np.where(maxInd)[0]
return maxInd 

None of these solutions worked for me since I wanted to find peaks in the center of repeating values as well. for example, in

ar = np.array([0,1,2,2,2,1,3,3,3,2,5,0])

the answer should be

array([ 3,  7, 10], dtype=int64)

I did this using a loop. I know it’s not super clean, but it gets the job done.

def findLocalMaxima(ar):
# find local maxima of array, including centers of repeating elements    
maxInd = np.zeros_like(ar)
peakVar = -np.inf
i = -1
while i < len(ar)-1:
#for i in range(len(ar)):
    i += 1
    if peakVar < ar[i]:
        peakVar = ar[i]
        for j in range(i,len(ar)):
            if peakVar < ar[j]:
                break
            elif peakVar == ar[j]:
                continue
            elif peakVar > ar[j]:
                peakInd = i + np.floor(abs(i-j)/2)
                maxInd[peakInd.astype(int)] = 1
                i = j
                break
    peakVar = ar[i]
maxInd = np.where(maxInd)[0]
return maxInd 

回答 9

import numpy as np
x=np.array([6,3,5,2,1,4,9,7,8])
y=np.array([2,1,3,5,3,9,8,10,7])
sortId=np.argsort(x)
x=x[sortId]
y=y[sortId]
minm = np.array([])
maxm = np.array([])
i = 0
while i < length-1:
    if i < length - 1:
        while i < length-1 and y[i+1] >= y[i]:
            i+=1

        if i != 0 and i < length-1:
            maxm = np.append(maxm,i)

        i+=1

    if i < length - 1:
        while i < length-1 and y[i+1] <= y[i]:
            i+=1

        if i < length-1:
            minm = np.append(minm,i)
        i+=1


print minm
print maxm

minm并分别maxm包含最小值和最大值的索引。对于庞大的数据集,它将提供很多最大值/最小值,因此在这种情况下,请先平滑曲线,然后再应用此算法。

import numpy as np
x=np.array([6,3,5,2,1,4,9,7,8])
y=np.array([2,1,3,5,3,9,8,10,7])
sortId=np.argsort(x)
x=x[sortId]
y=y[sortId]
minm = np.array([])
maxm = np.array([])
i = 0
while i < length-1:
    if i < length - 1:
        while i < length-1 and y[i+1] >= y[i]:
            i+=1

        if i != 0 and i < length-1:
            maxm = np.append(maxm,i)

        i+=1

    if i < length - 1:
        while i < length-1 and y[i+1] <= y[i]:
            i+=1

        if i < length-1:
            minm = np.append(minm,i)
        i+=1


print minm
print maxm

minm and maxm contain indices of minima and maxima, respectively. For a huge data set, it will give lots of maximas/minimas so in that case smooth the curve first and then apply this algorithm.


回答 10

使用膨胀运算符的另一种解决方案:

import numpy as np
from scipy.ndimage import rank_filter

def find_local_maxima(x):
   x_dilate = rank_filter(x, -1, size=3)
   return x_dilate == x

对于最小值:

def find_local_minima(x):
   x_erode = rank_filter(x, -0, size=3)
   return x_erode == x

此外,从scipy.ndimage可以替换rank_filter(x, -1, size=3)使用grey_dilation,并rank_filter(x, 0, size=3)grey_erosion。这不需要本地排序,因此速度稍快。

Another solution using essentially a dilate operator:

import numpy as np
from scipy.ndimage import rank_filter

def find_local_maxima(x):
   x_dilate = rank_filter(x, -1, size=3)
   return x_dilate == x

and for the minima:

def find_local_minima(x):
   x_erode = rank_filter(x, -0, size=3)
   return x_erode == x

Also, from scipy.ndimage you can replace rank_filter(x, -1, size=3) with grey_dilation and rank_filter(x, 0, size=3) with grey_erosion. This won’t require a local sort, so it is slightly faster.


回答 11

另一个:


def local_maxima_mask(vec):
    """
    Get a mask of all points in vec which are local maxima
    :param vec: A real-valued vector
    :return: A boolean mask of the same size where True elements correspond to maxima. 
    """
    mask = np.zeros(vec.shape, dtype=np.bool)
    greater_than_the_last = np.diff(vec)>0  # N-1
    mask[1:] = greater_than_the_last
    mask[:-1] &= ~greater_than_the_last
    return mask

Another one:


def local_maxima_mask(vec):
    """
    Get a mask of all points in vec which are local maxima
    :param vec: A real-valued vector
    :return: A boolean mask of the same size where True elements correspond to maxima. 
    """
    mask = np.zeros(vec.shape, dtype=np.bool)
    greater_than_the_last = np.diff(vec)>0  # N-1
    mask[1:] = greater_than_the_last
    mask[:-1] &= ~greater_than_the_last
    return mask

如何将字符串数组转换为numpy中的浮点数组?

问题:如何将字符串数组转换为numpy中的浮点数组?

如何转换

["1.1", "2.2", "3.2"]

[1.1, 2.2, 3.2]

在NumPy中?

How to convert

["1.1", "2.2", "3.2"]

to

[1.1, 2.2, 3.2]

in NumPy?


回答 0

好吧,如果您是以列表的形式读取数据,则可以这样做np.array(map(float, list_of_strings))(或等效地,使用列表理解)。(在Python 3,你需要调用listmap,如果你使用的返回值map,因为map现在返回一个迭代器)。

但是,如果已经是一串Numpy的字符串,则有更好的方法。使用astype()

import numpy as np
x = np.array(['1.1', '2.2', '3.3'])
y = x.astype(np.float)

Well, if you’re reading the data in as a list, just do np.array(map(float, list_of_strings)) (or equivalently, use a list comprehension). (In Python 3, you’ll need to call list on the map return value if you use map, since map returns an iterator now.)

However, if it’s already a numpy array of strings, there’s a better way. Use astype().

import numpy as np
x = np.array(['1.1', '2.2', '3.3'])
y = x.astype(np.float)

回答 1

您也可以使用它

import numpy as np
x=np.array(['1.1', '2.2', '3.3'])
x=np.asfarray(x,float)

You can use this as well

import numpy as np
x=np.array(['1.1', '2.2', '3.3'])
x=np.asfarray(x,float)

回答 2

另一个选项可能是numpy.asarray

import numpy as np
a = ["1.1", "2.2", "3.2"]
b = np.asarray(a, dtype=np.float64, order='C')

对于Python 2 *:

print a, type(a), type(a[0])
print b, type(b), type(b[0])

导致:

['1.1', '2.2', '3.2'] <type 'list'> <type 'str'>
[1.1 2.2 3.2] <type 'numpy.ndarray'> <type 'numpy.float64'>

Another option might be numpy.asarray:

import numpy as np
a = ["1.1", "2.2", "3.2"]
b = np.asarray(a, dtype=np.float64, order='C')

For Python 2*:

print a, type(a), type(a[0])
print b, type(b), type(b[0])

resulting in:

['1.1', '2.2', '3.2'] <type 'list'> <type 'str'>
[1.1 2.2 3.2] <type 'numpy.ndarray'> <type 'numpy.float64'>

回答 3

如果您拥有(或创建)单个字符串,则可以使用np.fromstring

import numpy as np
x = ["1.1", "2.2", "3.2"]
x = ','.join(x)
x = np.fromstring( x, dtype=np.float, sep=',' )

注意,x = ','.join(x)将x数组转换为string '1.1, 2.2, 3.2'。如果您从txt文件中读取一行,则每一行都已经是一个字符串。

If you have (or create) a single string, you can use np.fromstring:

import numpy as np
x = ["1.1", "2.2", "3.2"]
x = ','.join(x)
x = np.fromstring( x, dtype=np.float, sep=',' )

Note, x = ','.join(x) transforms the x array to string '1.1, 2.2, 3.2'. If you read a line from a txt file, each line will be already a string.


无法分配具有形状和数据类型的数组

问题:无法分配具有形状和数据类型的数组

我在Ubuntu 18上在numpy中分配大型数组时遇到了一个问题,而在MacOS上却没有遇到同样的问题。

我想一个numpy的阵列形状分配内存(156816, 36, 53806) 使用

np.zeros((156816, 36, 53806), dtype='uint8')

当我在Ubuntu OS上遇到错误时

>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8

我没有在MacOS上得到它:

>>> import numpy as np 
>>> np.zeros((156816, 36, 53806), dtype='uint8')
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint8)

我读过某处np.zeros不应该真正分配数组所需的全部内存,而只分配了非零元素。即使Ubuntu计算机具有64gb的内存,而我的MacBook Pro却只有16gb。

版本:

Ubuntu
os -> ubuntu mate 18
python -> 3.6.8
numpy -> 1.17.0

mac
os -> 10.14.6
python -> 3.6.4
numpy -> 1.17.0

PS:在Google Colab上也失败

I’m facing an issue with allocating huge arrays in numpy on Ubuntu 18 while not facing the same issue on MacOS.

I am trying to allocate memory for a numpy array with shape (156816, 36, 53806) with

np.zeros((156816, 36, 53806), dtype='uint8')

and while I’m getting an error on Ubuntu OS

>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8

I’m not getting it on MacOS:

>>> import numpy as np 
>>> np.zeros((156816, 36, 53806), dtype='uint8')
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint8)

I’ve read somewhere that np.zeros shouldn’t be really allocating the whole memory needed for the array, but only for the non-zero elements. Even though the Ubuntu machine has 64gb of memory, while my MacBook Pro has only 16gb.

versions:

Ubuntu
os -> ubuntu mate 18
python -> 3.6.8
numpy -> 1.17.0

mac
os -> 10.14.6
python -> 3.6.4
numpy -> 1.17.0

PS: also failed on Google Colab


回答 0

这可能是由于系统的过量使用处理模式所致。

在默认模式下0

启发式过量使用处理。明显的地址空间过量使用被拒绝。用于典型的系统。它确保严重的野生分配失败,同时允许过量使用以减少交换使用。在此模式下,允许root分配更多的内存。这是默认值。

此处没有很好地解释所使用的确切启发式方法,但是在Linux上,在提交启发式方法本页上对此进行了更多讨论 。

您可以通过运行以下命令检查当前的过量使用模式

$ cat /proc/sys/vm/overcommit_memory
0

在这种情况下,您要分配

>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588

约282 GB,并且内核说的很清楚,我无法将这么多物理页提交给它,并且它拒绝分配。

如果(以root用户身份)运行:

$ echo 1 > /proc/sys/vm/overcommit_memory

这将启用“始终过量使用”模式,并且您会发现,无论系统有多大(至少在64位内存寻址中),该系统的确允许您进行分配。

我自己在具有32 GB RAM的计算机上进行了测试。在过量提交模式下,0我还得到了一个MemoryError,但是将其更改回1它可以工作:

>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056

然后,您可以继续写入阵列中的任何位置,并且只有在您明确写入物理页面时,系统才会分配物理页面。因此,您可以谨慎地将其用于稀疏数组。

This is likely due to your system’s overcommit handling mode.

In the default mode, 0,

Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.

The exact heuristic used is not well explained here, but this is discussed more on Linux over commit heuristic and on this page.

You can check your current overcommit mode by running

$ cat /proc/sys/vm/overcommit_memory
0

In this case you’re allocating

>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588

~282 GB, and the kernel is saying well obviously there’s no way I’m going to be able to commit that many physical pages to this, and it refuses the allocation.

If (as root) you run:

$ echo 1 > /proc/sys/vm/overcommit_memory

This will enable “always overcommit” mode, and you’ll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).

I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0 I also got a MemoryError, but after changing it back to 1 it works:

>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056

You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.


回答 1

我在Window上遇到了同样的问题,并遇到了这个解决方案。因此,如果有人在Windows中遇到此问题,那么对我来说,解决方案是增加页面文件的大小,因为这对我来说也是内存过量使用的问题。

Windows 8

  1. 在键盘上按WindowsKey + X,然后在弹出菜单中单击“系统”。
  2. 点击或单击高级系统设置。系统可能会要求您输入管理员密码或确认选择
  3. 在“高级”选项卡上的“性能”下,点击或单击“设置”。
  4. 点击或单击“高级”选项卡,然后在“虚拟内存”下,单击或单击“更改”
  5. 清除“自动管理所有驱动器的页面文件大小”复选框。
  6. 在驱动器[卷标签]下,点击或单击包含要更改的页面文件的驱动器
  7. 点击或单击“自定义大小”,在“初始大小(MB)”或“最大大小(MB)”框中输入新的大小(以兆字节为单位),单击或单击“设置”,然后单击或单击“确定”
  8. 重新启动系统

Windows 10

  1. 按Windows键
  2. 类型SystemPropertiesAdvanced
  3. 单击以管理员身份运行
  4. 在性能下,单击设置
  5. 选择高级选项卡
  6. 选择更改…
  7. 取消选中“自动管理所有驱动器的页面文件大小”
  8. 然后选择自定义尺寸并填写适当的尺寸
  9. 按设置,然后按确定,然后从“虚拟内存”,“性能选项”和“系统属性”对话框退出
  10. 重新启动系统

注意:在此示例中,我的系统上没有足够的内存供〜282GB使用,但对于我的特殊情况,此方法有效。

编辑

这里建议的页面文件大小建议:

有一个公式可以计算正确的页面文件大小。初始大小是系统总内存的一半(1.5)x。最大大小为三(3)x初始大小。因此,假设您有4 GB(1 GB = 1,024 MB x 4 = 4,096 MB)的内存。初始大小为1.5 x 4,096 = 6,144 MB,最大大小为3 x 6,144 = 18,432 MB。

这里要记住一些事情:

但是,这没有考虑到计算机可能特有的其他重要因素和系统设置。同样,让Windows选择要使用的内容,而不是依赖于在另一台计算机上工作的任意公式。

也:

页面文件大小的增加可能有助于防止Windows中的不稳定和崩溃。但是,硬盘驱动器的读/写时间比数据存储在计算机内存中的情况要慢得多。页面文件较大将增加硬盘驱动器的工作量,从而导致其他所有文件的运行速度变慢。仅当遇到内存不足错误时才应增加页面文件的大小,并且仅作为临时解决方案。更好的解决方案是向计算机添加更多的内存。

I had this same problem on Window’s and came across this solution. So if someone comes across this problem in Windows the solution for me was to increase the pagefile size, as it was a Memory overcommitment problem for me too.

Windows 8

  1. On the Keyboard Press the WindowsKey + X then click System in the popup menu
  2. Tap or click Advanced system settings. You might be asked for an admin password or to confirm your choice
  3. On the Advanced tab, under Performance, tap or click Settings.
  4. Tap or click the Advanced tab, and then, under Virtual memory, tap or click Change
  5. Clear the Automatically manage paging file size for all drives check box.
  6. Under Drive [Volume Label], tap or click the drive that contains the paging file you want to change
  7. Tap or click Custom size, enter a new size in megabytes in the initial size (MB) or Maximum size (MB) box, tap or click Set, and then tap or click OK
  8. Reboot your system

Windows 10

  1. Press the Windows key
  2. Type SystemPropertiesAdvanced
  3. Click Run as administrator
  4. Click Settings
  5. Select the Advanced tab
  6. Select Change…
  7. Uncheck Automatically managing paging file size for all drives
  8. Then select Custom size and fill in the appropriate size
  9. Press Set then press OK then exit from the Virtual Memory, Performance Options, and System Properties Dialog
  10. Reboot your system

Note: I did not have the enough memory on my system for the ~282GB in this example but for my particular case this worked.

EDIT

From here the suggested recommendations for page file size:

There is a formula for calculating the correct pagefile size. Initial size is one and a half (1.5) x the amount of total system memory. Maximum size is three (3) x the initial size. So let’s say you have 4 GB (1 GB = 1,024 MB x 4 = 4,096 MB) of memory. The initial size would be 1.5 x 4,096 = 6,144 MB and the maximum size would be 3 x 6,144 = 18,432 MB.

Some things to keep in mind from here:

However, this does not take into consideration other important factors and system settings that may be unique to your computer. Again, let Windows choose what to use instead of relying on some arbitrary formula that worked on a different computer.

Also:

Increasing page file size may help prevent instabilities and crashing in Windows. However, a hard drive read/write times are much slower than what they would be if the data were in your computer memory. Having a larger page file is going to add extra work for your hard drive, causing everything else to run slower. Page file size should only be increased when encountering out-of-memory errors, and only as a temporary fix. A better solution is to adding more memory to the computer.


回答 2

我也在Windows上遇到了这个问题。对我来说,解决方案是从32位版本的Python切换到64位版本的Python。的确,像32位CPU这样的32位软件最多可以分配4 GB的RAM(2 ^ 32)。因此,如果您拥有超过4 GB的RAM,则32位版本将无法利用它。

使用64位版本的Python(在下载页面中标记为x86-64的版本),问题就消失了。

您可以通过输入解释器来检查您拥有哪个版本。我具有64位版本,现在具有: Python 3.7.5rc1 (tags/v3.7.5rc1:4082f600a5, Oct 1 2019, 20:28:14) [MSC v.1916 64 bit (AMD64)],其中[MSC v.1916 64位(AMD64)]表示“ 64位Python”。

:为写这篇文章(2020年5)时,matplotlib是不可用的python39,所以我安装推荐python37,64位。

资料来源:

I came across this problem on Windows too. The solution for me was to switch from a 32-bit to a 64-bit version of Python. Indeed, a 32-bit software, like a 32-bit CPU, can adress a maximum of 4 GB of RAM (2^32). So if you have more than 4 GB of RAM, a 32-bit version cannot take advantage of it.

With a 64-bit version of Python (the one labeled x86-64 in the download page), the issue disappeared.

You can check which version you have by entering the interpreter. I, with a 64-bit version, now have: Python 3.7.5rc1 (tags/v3.7.5rc1:4082f600a5, Oct 1 2019, 20:28:14) [MSC v.1916 64 bit (AMD64)], where [MSC v.1916 64 bit (AMD64)] means “64-bit Python”.

Note : as of the time of this writing (May 2020), matplotlib is not available on python39, so I recommand installing python37, 64 bits.

Sources :


回答 3

在我的情况下,添加dtype属性会将数组的dtype更改为较小的类型(从float64到uint8),减小数组的大小足以不会在Windows(64位)中引发MemoryError。

mask = np.zeros(edges.shape)

mask = np.zeros(edges.shape,dtype='uint8')

In my case, adding a dtype attribute changed dtype of the array to a smaller type(from float64 to uint8), decreasing array size enough to not throw MemoryError in Windows(64 bit).

from

mask = np.zeros(edges.shape)

to

mask = np.zeros(edges.shape,dtype='uint8')

回答 4

有时,由于内核已达到极限,会弹出此错误。尝试重新启动内核,然后重做必要的步骤。

Sometimes, this error pops up because of the kernel has reached its limit. Try to restart the kernel redo the necessary steps.


回答 5

将数据类型更改为另一种使用较少内存的数据。对我来说,我将数据类型更改为numpy.uint8:

data['label'] = data['label'].astype(np.uint8)

change the data type to another one which uses less memory works. For me, I change the data type to numpy.uint8:

data['label'] = data['label'].astype(np.uint8)

如何将多维数组写入文本文件?

问题:如何将多维数组写入文本文件?

在另一个问题中,如果我可以提供遇到问题的阵列,其他用户会提供一些帮助。但是,我什至无法完成基本的I / O任务,例如将数组写入文件。

谁能解释我需要向文件写入4x11x14 numpy数组的哪种循环?

该数组由四个11 x 14数组组成,因此我应该使用漂亮的换行符对其进行格式化,以使文件读取更加容易。

编辑:所以我已经尝试了numpy.savetxt函数。奇怪的是,它给出了以下错误:

TypeError: float argument required, not numpy.ndarray

我认为这是因为函数不适用于多维数组?我希望在一个文件中找到任何解决方案吗?

In another question, other users offered some help if I could supply the array I was having trouble with. However, I even fail at a basic I/O task, such as writing an array to a file.

Can anyone explain what kind of loop I would need to write a 4x11x14 numpy array to file?

This array consist of four 11 x 14 arrays, so I should format it with a nice newline, to make the reading of the file easier on others.

Edit: So I’ve tried the numpy.savetxt function. Strangely, it gives the following error:

TypeError: float argument required, not numpy.ndarray

I assume that this is because the function doesn’t work with multidimensional arrays? Any solutions as I would like them within one file?


回答 0

如果您想将其写入磁盘,以便轻松地以numpy数组的形式读回,请查看numpy.save。对其进行酸洗也可以,但是对于大型阵列而言效率较低(您的不是,因此两者都很好)。

如果您希望它易于阅读,请查看numpy.savetxt

编辑: 所以,savetxt对于> 2维数组似乎似乎不是一个很好的选择…但是只是为了得出所有结论,它的全部结论是:

我刚刚意识到,numpy.savetxt大于2维的ndarray上的阻塞…这可能是设计使然,因为没有固有定义的方式来指示文本文件中的其他维。

例如,这个(二维数组)可以正常工作

import numpy as np
x = np.arange(20).reshape((4,5))
np.savetxt('test.txt', x)

TypeError: float argument required, not numpy.ndarray对于3D数组,相同的操作将失败(错误消息不多:):

import numpy as np
x = np.arange(200).reshape((4,5,10))
np.savetxt('test.txt', x)

一种解决方法是将3D(或更大)阵列分成2D切片。例如

x = np.arange(200).reshape((4,5,10))
with file('test.txt', 'w') as outfile:
    for slice_2d in x:
        np.savetxt(outfile, slice_2d)

但是,我们的目标是使人类清晰易读,同时仍然可以轻松地将其读回numpy.loadtxt。因此,我们可以稍微冗长一些,并使用注释出的行来区分切片。默认情况下,numpy.loadtxt将忽略任何以#(或commentskwarg 指定的任何字符)开头的行。(看起来比实际要冗长得多…)

import numpy as np

# Generate some test data
data = np.arange(200).reshape((4,5,10))

# Write the array to disk
with open('test.txt', 'w') as outfile:
    # I'm writing a header here just for the sake of readability
    # Any line starting with "#" will be ignored by numpy.loadtxt
    outfile.write('# Array shape: {0}\n'.format(data.shape))

    # Iterating through a ndimensional array produces slices along
    # the last axis. This is equivalent to data[i,:,:] in this case
    for data_slice in data:

        # The formatting string indicates that I'm writing out
        # the values in left-justified columns 7 characters in width
        # with 2 decimal places.  
        np.savetxt(outfile, data_slice, fmt='%-7.2f')

        # Writing out a break to indicate different slices...
        outfile.write('# New slice\n')

这样生成:

# Array shape: (4, 5, 10)
0.00    1.00    2.00    3.00    4.00    5.00    6.00    7.00    8.00    9.00   
10.00   11.00   12.00   13.00   14.00   15.00   16.00   17.00   18.00   19.00  
20.00   21.00   22.00   23.00   24.00   25.00   26.00   27.00   28.00   29.00  
30.00   31.00   32.00   33.00   34.00   35.00   36.00   37.00   38.00   39.00  
40.00   41.00   42.00   43.00   44.00   45.00   46.00   47.00   48.00   49.00  
# New slice
50.00   51.00   52.00   53.00   54.00   55.00   56.00   57.00   58.00   59.00  
60.00   61.00   62.00   63.00   64.00   65.00   66.00   67.00   68.00   69.00  
70.00   71.00   72.00   73.00   74.00   75.00   76.00   77.00   78.00   79.00  
80.00   81.00   82.00   83.00   84.00   85.00   86.00   87.00   88.00   89.00  
90.00   91.00   92.00   93.00   94.00   95.00   96.00   97.00   98.00   99.00  
# New slice
100.00  101.00  102.00  103.00  104.00  105.00  106.00  107.00  108.00  109.00 
110.00  111.00  112.00  113.00  114.00  115.00  116.00  117.00  118.00  119.00 
120.00  121.00  122.00  123.00  124.00  125.00  126.00  127.00  128.00  129.00 
130.00  131.00  132.00  133.00  134.00  135.00  136.00  137.00  138.00  139.00 
140.00  141.00  142.00  143.00  144.00  145.00  146.00  147.00  148.00  149.00 
# New slice
150.00  151.00  152.00  153.00  154.00  155.00  156.00  157.00  158.00  159.00 
160.00  161.00  162.00  163.00  164.00  165.00  166.00  167.00  168.00  169.00 
170.00  171.00  172.00  173.00  174.00  175.00  176.00  177.00  178.00  179.00 
180.00  181.00  182.00  183.00  184.00  185.00  186.00  187.00  188.00  189.00 
190.00  191.00  192.00  193.00  194.00  195.00  196.00  197.00  198.00  199.00 
# New slice

只要我们知道原始数组的形状,就可以很容易地读回它。我们可以做numpy.loadtxt('test.txt').reshape((4,5,10))。作为一个示例(您可以在一行中完成此操作,我只是在冗长地澄清事情):

# Read the array from disk
new_data = np.loadtxt('test.txt')

# Note that this returned a 2D array!
print new_data.shape

# However, going back to 3D is easy if we know the 
# original shape of the array
new_data = new_data.reshape((4,5,10))

# Just to check that they're the same...
assert np.all(new_data == data)

If you want to write it to disk so that it will be easy to read back in as a numpy array, look into numpy.save. Pickling it will work fine, as well, but it’s less efficient for large arrays (which yours isn’t, so either is perfectly fine).

If you want it to be human readable, look into numpy.savetxt.

Edit: So, it seems like savetxt isn’t quite as great an option for arrays with >2 dimensions… But just to draw everything out to it’s full conclusion:

I just realized that numpy.savetxt chokes on ndarrays with more than 2 dimensions… This is probably by design, as there’s no inherently defined way to indicate additional dimensions in a text file.

E.g. This (a 2D array) works fine

import numpy as np
x = np.arange(20).reshape((4,5))
np.savetxt('test.txt', x)

While the same thing would fail (with a rather uninformative error: TypeError: float argument required, not numpy.ndarray) for a 3D array:

import numpy as np
x = np.arange(200).reshape((4,5,10))
np.savetxt('test.txt', x)

One workaround is just to break the 3D (or greater) array into 2D slices. E.g.

x = np.arange(200).reshape((4,5,10))
with open('test.txt', 'w') as outfile:
    for slice_2d in x:
        np.savetxt(outfile, slice_2d)

However, our goal is to be clearly human readable, while still being easily read back in with numpy.loadtxt. Therefore, we can be a bit more verbose, and differentiate the slices using commented out lines. By default, numpy.loadtxt will ignore any lines that start with # (or whichever character is specified by the comments kwarg). (This looks more verbose than it actually is…)

import numpy as np

# Generate some test data
data = np.arange(200).reshape((4,5,10))

# Write the array to disk
with open('test.txt', 'w') as outfile:
    # I'm writing a header here just for the sake of readability
    # Any line starting with "#" will be ignored by numpy.loadtxt
    outfile.write('# Array shape: {0}\n'.format(data.shape))
    
    # Iterating through a ndimensional array produces slices along
    # the last axis. This is equivalent to data[i,:,:] in this case
    for data_slice in data:

        # The formatting string indicates that I'm writing out
        # the values in left-justified columns 7 characters in width
        # with 2 decimal places.  
        np.savetxt(outfile, data_slice, fmt='%-7.2f')

        # Writing out a break to indicate different slices...
        outfile.write('# New slice\n')

This yields:

# Array shape: (4, 5, 10)
0.00    1.00    2.00    3.00    4.00    5.00    6.00    7.00    8.00    9.00   
10.00   11.00   12.00   13.00   14.00   15.00   16.00   17.00   18.00   19.00  
20.00   21.00   22.00   23.00   24.00   25.00   26.00   27.00   28.00   29.00  
30.00   31.00   32.00   33.00   34.00   35.00   36.00   37.00   38.00   39.00  
40.00   41.00   42.00   43.00   44.00   45.00   46.00   47.00   48.00   49.00  
# New slice
50.00   51.00   52.00   53.00   54.00   55.00   56.00   57.00   58.00   59.00  
60.00   61.00   62.00   63.00   64.00   65.00   66.00   67.00   68.00   69.00  
70.00   71.00   72.00   73.00   74.00   75.00   76.00   77.00   78.00   79.00  
80.00   81.00   82.00   83.00   84.00   85.00   86.00   87.00   88.00   89.00  
90.00   91.00   92.00   93.00   94.00   95.00   96.00   97.00   98.00   99.00  
# New slice
100.00  101.00  102.00  103.00  104.00  105.00  106.00  107.00  108.00  109.00 
110.00  111.00  112.00  113.00  114.00  115.00  116.00  117.00  118.00  119.00 
120.00  121.00  122.00  123.00  124.00  125.00  126.00  127.00  128.00  129.00 
130.00  131.00  132.00  133.00  134.00  135.00  136.00  137.00  138.00  139.00 
140.00  141.00  142.00  143.00  144.00  145.00  146.00  147.00  148.00  149.00 
# New slice
150.00  151.00  152.00  153.00  154.00  155.00  156.00  157.00  158.00  159.00 
160.00  161.00  162.00  163.00  164.00  165.00  166.00  167.00  168.00  169.00 
170.00  171.00  172.00  173.00  174.00  175.00  176.00  177.00  178.00  179.00 
180.00  181.00  182.00  183.00  184.00  185.00  186.00  187.00  188.00  189.00 
190.00  191.00  192.00  193.00  194.00  195.00  196.00  197.00  198.00  199.00 
# New slice

Reading it back in is very easy, as long as we know the shape of the original array. We can just do numpy.loadtxt('test.txt').reshape((4,5,10)). As an example (You can do this in one line, I’m just being verbose to clarify things):

# Read the array from disk
new_data = np.loadtxt('test.txt')

# Note that this returned a 2D array!
print new_data.shape

# However, going back to 3D is easy if we know the 
# original shape of the array
new_data = new_data.reshape((4,5,10))
    
# Just to check that they're the same...
assert np.all(new_data == data)

回答 1

鉴于我认为您有兴趣让人们可读该文件,因此我不确定这是否满足您的要求,但是如果这不是主要问题,就pickle可以了。

要保存它:

import pickle

my_data = {'a': [1, 2.0, 3, 4+6j],
           'b': ('string', u'Unicode string'),
           'c': None}
output = open('data.pkl', 'wb')
pickle.dump(my_data, output)
output.close()

读回:

import pprint, pickle

pkl_file = open('data.pkl', 'rb')

data1 = pickle.load(pkl_file)
pprint.pprint(data1)

pkl_file.close()

I am not certain if this meets your requirements, given I think you are interested in making the file readable by people, but if that’s not a primary concern, just pickle it.

To save it:

import pickle

my_data = {'a': [1, 2.0, 3, 4+6j],
           'b': ('string', u'Unicode string'),
           'c': None}
output = open('data.pkl', 'wb')
pickle.dump(my_data, output)
output.close()

To read it back:

import pprint, pickle

pkl_file = open('data.pkl', 'rb')

data1 = pickle.load(pkl_file)
pprint.pprint(data1)

pkl_file.close()

回答 2

如果不需要人类可读的输出,则可以尝试的另一种选择是将数组另存为MATLAB .mat文件,这是结构化数组。我鄙视MATLAB,但是我可以.mat在很少的几行中进行读写的事实很方便。

与乔金顿的回答,这样做的好处是,你不需要知道数据的原始形状.mat的文件,即在阅读无需重塑。而且,不像使用pickle,一个.mat文件可以通过MATLAB读取,以及其他一些程序/语言。

这是一个例子:

import numpy as np
import scipy.io

# Some test data
x = np.arange(200).reshape((4,5,10))

# Specify the filename of the .mat file
matfile = 'test_mat.mat'

# Write the array to the mat file. For this to work, the array must be the value
# corresponding to a key name of your choice in a dictionary
scipy.io.savemat(matfile, mdict={'out': x}, oned_as='row')

# For the above line, I specified the kwarg oned_as since python (2.7 with 
# numpy 1.6.1) throws a FutureWarning.  Here, this isn't really necessary 
# since oned_as is a kwarg for dealing with 1-D arrays.

# Now load in the data from the .mat that was just saved
matdata = scipy.io.loadmat(matfile)

# And just to check if the data is the same:
assert np.all(x == matdata['out'])

如果忘记了在.mat文件中为数组命名的键​​,则始终可以执行以下操作:

print matdata.keys()

当然,您可以使用更多键存储许多数组。

因此,是的-您的眼睛无法看懂它,而只需要两行即可写入和读取数据,我认为这是一个公平的权衡。

看看scipy.io.savematscipy.io.loadmat的文档, 以及本教程页面:scipy.io文件IO教程

If you don’t need a human-readable output, another option you could try is to save the array as a MATLAB .mat file, which is a structured array. I despise MATLAB, but the fact that I can both read and write a .mat in very few lines is convenient.

Unlike Joe Kington’s answer, the benefit of this is that you don’t need to know the original shape of the data in the .mat file, i.e. no need to reshape upon reading in. And, unlike using pickle, a .mat file can be read by MATLAB, and probably some other programs/languages as well.

Here is an example:

import numpy as np
import scipy.io

# Some test data
x = np.arange(200).reshape((4,5,10))

# Specify the filename of the .mat file
matfile = 'test_mat.mat'

# Write the array to the mat file. For this to work, the array must be the value
# corresponding to a key name of your choice in a dictionary
scipy.io.savemat(matfile, mdict={'out': x}, oned_as='row')

# For the above line, I specified the kwarg oned_as since python (2.7 with 
# numpy 1.6.1) throws a FutureWarning.  Here, this isn't really necessary 
# since oned_as is a kwarg for dealing with 1-D arrays.

# Now load in the data from the .mat that was just saved
matdata = scipy.io.loadmat(matfile)

# And just to check if the data is the same:
assert np.all(x == matdata['out'])

If you forget the key that the array is named in the .mat file, you can always do:

print matdata.keys()

And of course you can store many arrays using many more keys.

So yes – it won’t be readable with your eyes, but only takes 2 lines to write and read the data, which I think is a fair trade-off.

Take a look at the docs for scipy.io.savemat and scipy.io.loadmat and also this tutorial page: scipy.io File IO Tutorial


回答 3

ndarray.tofile() 应该也可以

例如,如果您的数组被调用a

a.tofile('yourfile.txt',sep=" ",format="%s")

虽然不确定如何获取换行格式。

编辑在此处向Kevin J. Black发表评论):

从1.5.0版开始,np.tofile()采用可选参数 newline='\n'以允许多行输出。 https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.savetxt.html

ndarray.tofile() should also work

e.g. if your array is called a:

a.tofile('yourfile.txt',sep=" ",format="%s")

Not sure how to get newline formatting though.

Edit (credit Kevin J. Black’s comment here):

Since version 1.5.0, np.tofile() takes an optional parameter newline='\n' to allow multi-line output. https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.savetxt.html


回答 4

有专门的库可以做到这一点。(加上python包装器)

希望这可以帮助

There exist special libraries to do just that. (Plus wrappers for python)

hope this helps


回答 5

您可以简单地在三个嵌套循环中遍历数组并将其值写入文件。为了阅读,您只需使用完全相同的循环结构即可。您将以正确的顺序获得值,以再次正确填充数组。

You can simply traverse the array in three nested loops and write their values to your file. For reading, you simply use the same exact loop construction. You will get the values in exactly the right order to fill your arrays correctly again.


回答 6

我有一种方法可以使用简单的filename.write()操作。它对我来说很好用,但是我正在处理具有约1500个数据元素的数组。

我基本上只需要for循环来遍历文件,然后以csv样式输出将其逐行写入输出目标。

import numpy as np

trial = np.genfromtxt("/extension/file.txt", dtype = str, delimiter = ",")

with open("/extension/file.txt", "w") as f:
    for x in xrange(len(trial[:,1])):
        for y in range(num_of_columns):
            if y < num_of_columns-2:
                f.write(trial[x][y] + ",")
            elif y == num_of_columns-1:
                f.write(trial[x][y])
        f.write("\n")

if和elif语句用于在数据元素之间添加逗号。无论出于何种原因,当以nd数组形式读取文件时,这些内容都会被删除。我的目标是将文件输出为csv,因此此方法有助于解决该问题。

希望这可以帮助!

I have a way to do it using a simply filename.write() operation. It works fine for me, but I’m dealing with arrays having ~1500 data elements.

I basically just have for loops to iterate through the file and write it to the output destination line-by-line in a csv style output.

import numpy as np

trial = np.genfromtxt("/extension/file.txt", dtype = str, delimiter = ",")

with open("/extension/file.txt", "w") as f:
    for x in xrange(len(trial[:,1])):
        for y in range(num_of_columns):
            if y < num_of_columns-2:
                f.write(trial[x][y] + ",")
            elif y == num_of_columns-1:
                f.write(trial[x][y])
        f.write("\n")

The if and elif statement are used to add commas between the data elements. For whatever reason, these get stripped out when reading the file in as an nd array. My goal was to output the file as a csv, so this method helps to handle that.

Hope this helps!


回答 7

泡菜最适合这些情况。假设您有一个名为的ndarray x_train。您可以将其转储到文件中,然后使用以下命令将其还原:

import pickle

###Load into file
with open("myfile.pkl","wb") as f:
    pickle.dump(x_train,f)

###Extract from file
with open("myfile.pkl","rb") as f:
    x_temp = pickle.load(f)

Pickle is best for these cases. Suppose you have a ndarray named x_train. You can dump it into a file and revert it back using the following command:

import pickle

###Load into file
with open("myfile.pkl","wb") as f:
    pickle.dump(x_train,f)

###Extract from file
with open("myfile.pkl","rb") as f:
    x_temp = pickle.load(f)

如何修复imdb.load_data()函数的“ allow_pickle = False时无法加载对象数组”?

问题:如何修复imdb.load_data()函数的“ allow_pickle = False时无法加载对象数组”?

我正在尝试使用Google Colab中的IMDb数据集实现二进制分类示例。我以前已经实现了此模型。但是,几天后我再次尝试执行此操作时,它返回一个值错误:对于load_data()函数,当allow_pickle = False时无法加载对象数组。

我已经尝试解决此问题,请参考一个类似问题的现有答案:如何修复sketch_rnn算法中的“当allow_pickle = False时无法加载对象数组”, 但事实证明,仅添加allow_pickle参数是不够的。

我的代码:

from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

错误:

ValueError                                Traceback (most recent call last)
<ipython-input-1-2ab3902db485> in <module>()
      1 from keras.datasets import imdb
----> 2 (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

2 frames
/usr/local/lib/python3.6/dist-packages/keras/datasets/imdb.py in load_data(path, num_words, skip_top, maxlen, seed, start_char, oov_char, index_from, **kwargs)
     57                     file_hash='599dadb1135973df5b59232a0e9a887c')
     58     with np.load(path) as f:
---> 59         x_train, labels_train = f['x_train'], f['y_train']
     60         x_test, labels_test = f['x_test'], f['y_test']
     61 

/usr/local/lib/python3.6/dist-packages/numpy/lib/npyio.py in __getitem__(self, key)
    260                 return format.read_array(bytes,
    261                                          allow_pickle=self.allow_pickle,
--> 262                                          pickle_kwargs=self.pickle_kwargs)
    263             else:
    264                 return self.zip.read(key)

/usr/local/lib/python3.6/dist-packages/numpy/lib/format.py in read_array(fp, allow_pickle, pickle_kwargs)
    690         # The array contained Python objects. We need to unpickle the data.
    691         if not allow_pickle:
--> 692             raise ValueError("Object arrays cannot be loaded when "
    693                              "allow_pickle=False")
    694         if pickle_kwargs is None:

ValueError: Object arrays cannot be loaded when allow_pickle=False

I’m trying to implement the binary classification example using the IMDb dataset in Google Colab. I have implemented this model before. But when I tried to do it again after a few days, it returned a value error: 'Object arrays cannot be loaded when allow_pickle=False' for the load_data() function.

I have already tried solving this, referring to an existing answer for a similar problem: How to fix ‘Object arrays cannot be loaded when allow_pickle=False’ in the sketch_rnn algorithm. But it turns out that just adding an allow_pickle argument isn’t sufficient.

My code:

from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

The error:

ValueError                                Traceback (most recent call last)
<ipython-input-1-2ab3902db485> in <module>()
      1 from keras.datasets import imdb
----> 2 (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

2 frames
/usr/local/lib/python3.6/dist-packages/keras/datasets/imdb.py in load_data(path, num_words, skip_top, maxlen, seed, start_char, oov_char, index_from, **kwargs)
     57                     file_hash='599dadb1135973df5b59232a0e9a887c')
     58     with np.load(path) as f:
---> 59         x_train, labels_train = f['x_train'], f['y_train']
     60         x_test, labels_test = f['x_test'], f['y_test']
     61 

/usr/local/lib/python3.6/dist-packages/numpy/lib/npyio.py in __getitem__(self, key)
    260                 return format.read_array(bytes,
    261                                          allow_pickle=self.allow_pickle,
--> 262                                          pickle_kwargs=self.pickle_kwargs)
    263             else:
    264                 return self.zip.read(key)

/usr/local/lib/python3.6/dist-packages/numpy/lib/format.py in read_array(fp, allow_pickle, pickle_kwargs)
    690         # The array contained Python objects. We need to unpickle the data.
    691         if not allow_pickle:
--> 692             raise ValueError("Object arrays cannot be loaded when "
    693                              "allow_pickle=False")
    694         if pickle_kwargs is None:

ValueError: Object arrays cannot be loaded when allow_pickle=False

回答 0

这是一个强制imdb.load_data在笔记本中允许泡菜代替此行的技巧:

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

这样:

import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

# restore np.load for future normal usage
np.load = np_load_old

Here’s a trick to force imdb.load_data to allow pickle by, in your notebook, replacing this line:

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

by this:

import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

# restore np.load for future normal usage
np.load = np_load_old

回答 1

这个问题仍然存在于keras git上。我希望它能尽快解决。在此之前,请尝试将numpy版本降级为1.16.2。看来解决了问题。

!pip install numpy==1.16.1
import numpy as np

此版本的numpy的默认值为allow_pickleas True

This issue is still up on keras git. I hope it gets solved as soon as possible. Until then, try downgrading your numpy version to 1.16.2. It seems to solve the problem.

!pip install numpy==1.16.1
import numpy as np

This version of numpy has the default value of allow_pickle as True.


回答 2

在GitHub上发布此问题之后,官方解决方案是编辑imdb.py文件。此修复程序对我来说效果很好,无需降级numpy。在以下位置找到imdb.py文件tensorflow/python/keras/datasets/imdb.py(对我来说完整路径是:C:\Anaconda\Lib\site-packages\tensorflow\python\keras\datasets\imdb.py-其他安装将有所不同),并根据差异更改第85行:

-  with np.load(path) as f:
+  with np.load(path, allow_pickle=True) as f:

进行更改的原因是为了防止在腌制文件中使用与SQL注入等效的Python。上面的更改仅会影响imdb数据,因此您将安全性保留在其他位置(不降低numpy的级别)。

Following this issue on GitHub, the official solution is to edit the imdb.py file. This fix worked well for me without the need to downgrade numpy. Find the imdb.py file at tensorflow/python/keras/datasets/imdb.py (full path for me was: C:\Anaconda\Lib\site-packages\tensorflow\python\keras\datasets\imdb.py – other installs will be different) and change line 85 as per the diff:

-  with np.load(path) as f:
+  with np.load(path, allow_pickle=True) as f:

The reason for the change is security to prevent the Python equivalent of an SQL injection in a pickled file. The change above will ONLY effect the imdb data and you therefore retain the security elsewhere (by not downgrading numpy).


回答 3

我只是使用allow_pickle = True作为np.load()的参数,它对我有用。

I just used allow_pickle = True as an argument to np.load() and it worked for me.


回答 4

就我而言:

np.load(path, allow_pickle=True)

In my case worked with:

np.load(path, allow_pickle=True)

回答 5

我认为cheez的答案(https://stackoverflow.com/users/122933/cheez)是最简单,最有效。我将对其进行详细说明,以便在整个会话期间都不会修改numpy函数。

我的建议如下。我正在使用它从keras下载路透社数据集,该数据集显示了相同类型的错误:

old = np.load
np.load = lambda *a,**k: old(*a,**k,allow_pickle=True)

from keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

np.load = old
del(old)

I think the answer from cheez (https://stackoverflow.com/users/122933/cheez) is the easiest and most effective one. I’d elaborate a little bit over it so it would not modify a numpy function for the whole session period.

My suggestion is below. I´m using it to download the reuters dataset from keras which is showing the same kind of error:

old = np.load
np.load = lambda *a,**k: old(*a,**k,allow_pickle=True)

from keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

np.load = old
del(old)

回答 6

您可以尝试更改标志的值

np.load(training_image_names_array,allow_pickle=True)

You can try changing the flag’s value

np.load(training_image_names_array,allow_pickle=True)

回答 7

上面列出的解决方案都不对我有用:我使用python 3.7.3运行anaconda。对我有用的是

  • 从Anaconda powershell运行“ conda install numpy == 1.16.1”

  • 关闭并重新打开笔记本

none of the above listed solutions worked for me: i run anaconda with python 3.7.3. What worked for me was

  • run “conda install numpy==1.16.1” from Anaconda powershell

  • close and reopen the notebook


回答 8

在jupyter笔记本上使用

np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

工作正常,但问题是在spyder中使用此方法时出现的(您每次必须重新启动内核,否则会收到类似以下的错误:

TypeError :()为关键字参数“ allow_pickle”获得了多个值

我在这里使用解决方案解决了这个问题:

on jupyter notebook using

np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

worked fine, but the problem appears when you use this method in spyder(you have to restart the kernel every time or you will get an error like:

TypeError : () got multiple values for keyword argument ‘allow_pickle’

I solved this issue using the solution here:


回答 9

我降落在这里,尝试了您的方式,无法解决。

我实际上是在编写给定的代码

pickle.load(path)

用过,所以我用

np.load(path, allow_pickle=True)

I landed up here, tried your ways and could not figure out.

I was actually working on a pregiven code where

pickle.load(path)

was used so i replaced it with

np.load(path, allow_pickle=True)

回答 10

是的,安装以前的numpy版本可以解决此问题。

对于使用PyCharm IDE的用户:

在我的IDE(Pycharm)中,依次单击File-> Settings-> Project Interpreter:我发现numpy为1.16.3,所以我恢复为1.16.1。单击+并在搜索中键入numpy,在“指定版本”上打勾:1.16.1并选择->安装软件包。

Yes, installing previous a version of numpy solved the problem.

For those who uses PyCharm IDE:

in my IDE (Pycharm), File->Settings->Project Interpreter: I found my numpy to be 1.16.3, so I revert back to 1.16.1. Click + and type numpy in the search, tick “specify version” : 1.16.1 and choose–> install package.


回答 11

找到imdb.py的路径,然后将标志添加到np.load(path,… flag …)

    def load_data(.......):
    .......................................
    .......................................
    - with np.load(path) as f:
    + with np.load(path,allow_pickle=True) as f:

find the path to imdb.py then just add the flag to np.load(path,…flag…)

    def load_data(.......):
    .......................................
    .......................................
    - with np.load(path) as f:
    + with np.load(path,allow_pickle=True) as f:

回答 12

它对我的工作

        np_load_old = np.load
        np.load = lambda *a: np_load_old(*a, allow_pickle=True)
        (x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)
        np.load = np_load_old

Its work for me

        np_load_old = np.load
        np.load = lambda *a: np_load_old(*a, allow_pickle=True)
        (x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)
        np.load = np_load_old

回答 13

我发现TensorFlow 2.0(我正在使用2.0.0-alpha0)与最新版本的Numpy不兼容,即v1.17.0(可能还有v1.16.5 +)。导入TF2后,它会抛出巨大的FutureWarning列表,如下所示:

FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/anaconda3/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/anaconda3/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/anaconda3/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

尝试从keras加载imdb数据集时,这还会导致allow_pickle错误

我尝试使用以下效果很好的解决方案,但是我必须在导入TF2或tf.keras的每个项目中都这样做。

np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

我找到的最简单的解决方案是全局安装numpy 1.16.1,或者在虚拟环境中使用tensorflow和numpy的兼容版本。

我的答案是指出,这不仅是imdb.load_data的问题,而且是TF2和Numpy版本不兼容所引起的更大的问题,并且可能导致许多其他隐藏的错误或问题。

What I have found is that TensorFlow 2.0 (I am using 2.0.0-alpha0) is not compatible with the latest version of Numpy i.e. v1.17.0 (and possibly v1.16.5+). As soon as TF2 is imported, it throws a huge list of FutureWarning, that looks something like this:

FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/anaconda3/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/anaconda3/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/anaconda3/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

This also resulted in the allow_pickle error when tried to load imdb dataset from keras

I tried to use the following solution which worked just fine, but I had to do it every single project where I was importing TF2 or tf.keras.

np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

The easiest solution I found was to either install numpy 1.16.1 globally, or use compatible versions of tensorflow and numpy in a virtual environment.

My goal with this answer is to point out that its not just a problem with imdb.load_data, but a larger problem vaused by incompatibility of TF2 and Numpy versions and may result in many other hidden bugs or issues.


回答 14

Tensorflow在tf-nightly版本中有一个修复。

!pip install tf-nightly

当前版本是’2.0.0-dev20190511’。

Tensorflow has a fix in tf-nightly version.

!pip install tf-nightly

The current version is ‘2.0.0-dev20190511’.


回答 15

@cheez答案有时不起作用,并一次又一次地递归调用该函数。要解决此问题,您应该深入复制该功能。您可以使用函数来完成此操作partial,因此最终代码为:

import numpy as np
from functools import partial

# save np.load
np_load_old = partial(np.load)

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
(train_data, train_labels), (test_data, test_labels) = 
imdb.load_data(num_words=10000)

# restore np.load for future normal usage
np.load = np_load_old

The answer of @cheez sometime doesn’t work and recursively call the function again and again. To solve this problem you should copy the function deeply. You can do this by using the function partial, so the final code is:

import numpy as np
from functools import partial

# save np.load
np_load_old = partial(np.load)

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
(train_data, train_labels), (test_data, test_labels) = 
imdb.load_data(num_words=10000)

# restore np.load for future normal usage
np.load = np_load_old

回答 16

我通常不发布这些东西,但这太烦人了。混淆来自某些Keras imdb.py文件已经更新的事实:

with np.load(path) as f:

到的版本allow_pickle=True。确保检查imdb.py文件以查看是否已实施此更改。如果已进行调整,则可以很好地进行以下操作:

from keras.datasets import imdb
(train_text, train_labels), (test_text, test_labels) = imdb.load_data(num_words=10000)

I don’t usually post to these things but this was super annoying. The confusion comes from the fact that some of the Keras imdb.py files have already updated:

with np.load(path) as f:

to the version with allow_pickle=True. Make sure check the imdb.py file to see if this change was already implemented. If it has been adjusted, the following works fine:

from keras.datasets import imdb
(train_text, train_labels), (test_text, test_labels) = imdb.load_data(num_words=10000)

回答 17

最简单的方法是将imdb.py设置更改allow_pickle=True为引发错误np.load的行imdb.py

The easiest way is to change imdb.py setting allow_pickle=True to np.load at the line where imdb.py throws error.


如何将RGB图像转换为numpy数组?

问题:如何将RGB图像转换为numpy数组?

我有RGB图像。我想将其转换为numpy数组。我做了以下

im = cv.LoadImage("abc.tiff")
a = numpy.asarray(im)

它创建一个没有形状的数组。我假设它是一个iplimage对象。

I have an RGB image. I want to convert it to numpy array. I did the following

im = cv.LoadImage("abc.tiff")
a = numpy.asarray(im)

It creates an array with no shape. I assume it is a iplimage object.


回答 0

您可以使用较新的OpenCV python接口(如果我没记错的话,自Ope​​nCV 2.2起就可以使用)。它本机使用numpy数组:

import cv2
im = cv2.imread("abc.tiff",mode='RGB')
print type(im)

结果:

<type 'numpy.ndarray'>

You can use newer OpenCV python interface (if I’m not mistaken it is available since OpenCV 2.2). It natively uses numpy arrays:

import cv2
im = cv2.imread("abc.tiff",mode='RGB')
print type(im)

result:

<type 'numpy.ndarray'>

回答 1

PIL(Python影像库)和Numpy可以很好地协同工作。

我使用以下功能。

from PIL import Image
import numpy as np

def load_image( infilename ) :
    img = Image.open( infilename )
    img.load()
    data = np.asarray( img, dtype="int32" )
    return data

def save_image( npdata, outfilename ) :
    img = Image.fromarray( np.asarray( np.clip(npdata,0,255), dtype="uint8"), "L" )
    img.save( outfilename )

“ Image.fromarray”有点难看,因为我将传入的数据裁剪为[0,255],转换为字节,然后创建灰度图像。我大部分时间都是灰色工作。

RGB图像如下所示:

 outimg = Image.fromarray( ycc_uint8, "RGB" )
 outimg.save( "ycc.tif" )

PIL (Python Imaging Library) and Numpy work well together.

I use the following functions.

from PIL import Image
import numpy as np

def load_image( infilename ) :
    img = Image.open( infilename )
    img.load()
    data = np.asarray( img, dtype="int32" )
    return data

def save_image( npdata, outfilename ) :
    img = Image.fromarray( np.asarray( np.clip(npdata,0,255), dtype="uint8"), "L" )
    img.save( outfilename )

The ‘Image.fromarray’ is a little ugly because I clip incoming data to [0,255], convert to bytes, then create a grayscale image. I mostly work in gray.

An RGB image would be something like:

 outimg = Image.fromarray( ycc_uint8, "RGB" )
 outimg.save( "ycc.tif" )

回答 2

您也可以为此使用matplotlib

from matplotlib.image import imread

img = imread('abc.tiff')
print(type(img))

输出: <class 'numpy.ndarray'>

You can also use matplotlib for this.

from matplotlib.image import imread

img = imread('abc.tiff')
print(type(img))

output: <class 'numpy.ndarray'>


回答 3

截至今天,您最好的选择是使用:

img = cv2.imread(image_path)   # reads an image in the BGR format
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)   # BGR -> RGB

您将看到img一个类型为numpy的数组:

<class 'numpy.ndarray'>

As of today, your best bet is to use:

img = cv2.imread(image_path)   # reads an image in the BGR format
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)   # BGR -> RGB

You’ll see img will be a numpy array of type:

<class 'numpy.ndarray'>

回答 4

答案较晚,但imageio与其他替代方案相比,我更喜欢该模块

import imageio
im = imageio.imread('abc.tiff')

与相似cv2.imread(),默认情况下会生成numpy数组,但格式为RGB。

Late answer, but I’ve come to prefer the imageio module to the other alternatives

import imageio
im = imageio.imread('abc.tiff')

Similar to cv2.imread(), it produces a numpy array by default, but in RGB form.


回答 5

您需要使用cv.LoadImageM而不是cv.LoadImage:

In [1]: import cv
In [2]: import numpy as np
In [3]: x = cv.LoadImageM('im.tif')
In [4]: im = np.asarray(x)
In [5]: im.shape
Out[5]: (487, 650, 3)

You need to use cv.LoadImageM instead of cv.LoadImage:

In [1]: import cv
In [2]: import numpy as np
In [3]: x = cv.LoadImageM('im.tif')
In [4]: im = np.asarray(x)
In [5]: im.shape
Out[5]: (487, 650, 3)

回答 6

当使用David Poole的答案时,出现灰度PNG以及其他文件的SystemError。我的解决方案是:

import numpy as np
from PIL import Image

img = Image.open( filename )
try:
    data = np.asarray( img, dtype='uint8' )
except SystemError:
    data = np.asarray( img.getdata(), dtype='uint8' )

实际上img.getdata()适用于所有文件,但速度较慢,因此仅在其他方法失败时才使用它。

When using the answer from David Poole I get a SystemError with gray scale PNGs and maybe other files. My solution is:

import numpy as np
from PIL import Image

img = Image.open( filename )
try:
    data = np.asarray( img, dtype='uint8' )
except SystemError:
    data = np.asarray( img.getdata(), dtype='uint8' )

Actually img.getdata() would work for all files, but it’s slower, so I use it only when the other method fails.


回答 7

OpenCV映像格式支持numpy数组接口。可以创建一个辅助功能来支持灰度或彩色图像。这意味着可以使用numpy slice而不是图像数据的完整副本方便地完成BGR-> RGB转换。

注意:这是一个大技巧,因此修改输出数组也将更改OpenCV图像数据。如果要复制,请.copy()在阵列上使用方法!

import numpy as np

def img_as_array(im):
    """OpenCV's native format to a numpy array view"""
    w, h, n = im.width, im.height, im.channels
    modes = {1: "L", 3: "RGB", 4: "RGBA"}
    if n not in modes:
        raise Exception('unsupported number of channels: {0}'.format(n))
    out = np.asarray(im)
    if n != 1:
        out = out[:, :, ::-1]  # BGR -> RGB conversion
    return out

OpenCV image format supports the numpy array interface. A helper function can be made to support either grayscale or color images. This means the BGR -> RGB conversion can be conveniently done with a numpy slice, not a full copy of image data.

Note: this is a stride trick, so modifying the output array will also change the OpenCV image data. If you want a copy, use .copy() method on the array!

import numpy as np

def img_as_array(im):
    """OpenCV's native format to a numpy array view"""
    w, h, n = im.width, im.height, im.channels
    modes = {1: "L", 3: "RGB", 4: "RGBA"}
    if n not in modes:
        raise Exception('unsupported number of channels: {0}'.format(n))
    out = np.asarray(im)
    if n != 1:
        out = out[:, :, ::-1]  # BGR -> RGB conversion
    return out

回答 8

我也采用了imageio,但发现以下机器可用于预处理和后期处理:

import imageio
import numpy as np

def imload(*a, **k):
    i = imageio.imread(*a, **k)
    i = i.transpose((1, 0, 2))  # x and y are mixed up for some reason...
    i = np.flip(i, 1)  # make coordinate system right-handed!!!!!!
    return i/255


def imsave(i, url, *a, **k):
    # Original order of arguments was counterintuitive. It should
    # read verbally "Save the image to the URL" — not "Save to the
    # URL the image."

    i = np.flip(i, 1)
    i = i.transpose((1, 0, 2))
    i *= 255

    i = i.round()
    i = np.maximum(i, 0)
    i = np.minimum(i, 255)

    i = np.asarray(i, dtype=np.uint8)

    imageio.imwrite(url, i, *a, **k)

原因是我使用numpy进行图像处理,而不仅仅是图像显示。为此,uint8s很尴尬,因此我将其转换为从0到1的浮点值。

保存图像时,我注意到我必须自己剪切超出范围的值,否则最终会得到真正的灰色输出。(灰色输出是将整个范围(在[0,256]之外)压缩到范围内的值的图像的结果。)

我在评论中也提到了其他一些奇怪之处。

I also adopted imageio, but I found the following machinery useful for pre- and post-processing:

import imageio
import numpy as np

def imload(*a, **k):
    i = imageio.imread(*a, **k)
    i = i.transpose((1, 0, 2))  # x and y are mixed up for some reason...
    i = np.flip(i, 1)  # make coordinate system right-handed!!!!!!
    return i/255


def imsave(i, url, *a, **k):
    # Original order of arguments was counterintuitive. It should
    # read verbally "Save the image to the URL" — not "Save to the
    # URL the image."

    i = np.flip(i, 1)
    i = i.transpose((1, 0, 2))
    i *= 255

    i = i.round()
    i = np.maximum(i, 0)
    i = np.minimum(i, 255)

    i = np.asarray(i, dtype=np.uint8)

    imageio.imwrite(url, i, *a, **k)

The rationale is that I am using numpy for image processing, not just image displaying. For this purpose, uint8s are awkward, so I convert to floating point values ranging from 0 to 1.

When saving images, I noticed I had to cut the out-of-range values myself, or else I ended up with a really gray output. (The gray output was the result of imageio compressing the full range, which was outside of [0, 256), to values that were inside the range.)

There were a couple other oddities, too, which I mentioned in the comments.


回答 9

您可以使用numpy和轻松获得RGB图片的numpy数组Image from PIL

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

im = Image.open('*image_name*') #These two lines
im_arr = np.array(im) #are all you need
plt.imshow(im_arr) #Just to verify that image array has been constructed properly

You can get numpy array of rgb image easily by using numpy and Image from PIL

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

im = Image.open('*image_name*') #These two lines
im_arr = np.array(im) #are all you need
plt.imshow(im_arr) #Just to verify that image array has been constructed properly

回答 10

使用以下语法加载图像:

from keras.preprocessing import image

X_test=image.load_img('four.png',target_size=(28,28),color_mode="grayscale"); #loading image and then convert it into grayscale and with it's target size 
X_test=image.img_to_array(X_test); #convert image into array

load the image by using following syntax:-

from keras.preprocessing import image

X_test=image.load_img('four.png',target_size=(28,28),color_mode="grayscale"); #loading image and then convert it into grayscale and with it's target size 
X_test=image.img_to_array(X_test); #convert image into array

Python中的主成分分析

问题:Python中的主成分分析

我想使用主成分分析(PCA)进行降维。numpy或scipy是否已经拥有它,或者我必须使用自己滚动numpy.linalg.eigh

我不只是想使用奇异值分解(SVD),因为我的输入数据是相当高的维度(约460个维度),因此我认为SVD比计算协方差矩阵的特征向量要慢。

我希望找到一个预制的,已调试的实现,该实现已经对何时使用哪种方法以及哪些可能进行的其他优化进行了正确的决策,而这些优化我都不知道。

I’d like to use principal component analysis (PCA) for dimensionality reduction. Does numpy or scipy already have it, or do I have to roll my own using numpy.linalg.eigh?

I don’t just want to use singular value decomposition (SVD) because my input data are quite high-dimensional (~460 dimensions), so I think SVD will be slower than computing the eigenvectors of the covariance matrix.

I was hoping to find a premade, debugged implementation that already makes the right decisions for when to use which method, and which maybe does other optimizations that I don’t know about.


回答 0

您可以看看MDP

我没有机会亲自对其进行测试,但是我已将其完全标记为PCA功能。

You might have a look at MDP.

I have not had the chance to test it myself, but I’ve bookmarked it exactly for the PCA functionality.


回答 1

几个月后,这是一门小型PCA和一张图片:

#!/usr/bin/env python
""" a small class for Principal Component Analysis
Usage:
    p = PCA( A, fraction=0.90 )
In:
    A: an array of e.g. 1000 observations x 20 variables, 1000 rows x 20 columns
    fraction: use principal components that account for e.g.
        90 % of the total variance

Out:
    p.U, p.d, p.Vt: from numpy.linalg.svd, A = U . d . Vt
    p.dinv: 1/d or 0, see NR
    p.eigen: the eigenvalues of A*A, in decreasing order (p.d**2).
        eigen[j] / eigen.sum() is variable j's fraction of the total variance;
        look at the first few eigen[] to see how many PCs get to 90 %, 95 % ...
    p.npc: number of principal components,
        e.g. 2 if the top 2 eigenvalues are >= `fraction` of the total.
        It's ok to change this; methods use the current value.

Methods:
    The methods of class PCA transform vectors or arrays of e.g.
    20 variables, 2 principal components and 1000 observations,
    using partial matrices U' d' Vt', parts of the full U d Vt:
    A ~ U' . d' . Vt' where e.g.
        U' is 1000 x 2
        d' is diag([ d0, d1 ]), the 2 largest singular values
        Vt' is 2 x 20.  Dropping the primes,

    d . Vt      2 principal vars = p.vars_pc( 20 vars )
    U           1000 obs = p.pc_obs( 2 principal vars )
    U . d . Vt  1000 obs, p.obs( 20 vars ) = pc_obs( vars_pc( vars ))
        fast approximate A . vars, using the `npc` principal components

    Ut              2 pcs = p.obs_pc( 1000 obs )
    V . dinv        20 vars = p.pc_vars( 2 principal vars )
    V . dinv . Ut   20 vars, p.vars( 1000 obs ) = pc_vars( obs_pc( obs )),
        fast approximate Ainverse . obs: vars that give ~ those obs.


Notes:
    PCA does not center or scale A; you usually want to first
        A -= A.mean(A, axis=0)
        A /= A.std(A, axis=0)
    with the little class Center or the like, below.

See also:
    http://en.wikipedia.org/wiki/Principal_component_analysis
    http://en.wikipedia.org/wiki/Singular_value_decomposition
    Press et al., Numerical Recipes (2 or 3 ed), SVD
    PCA micro-tutorial
    iris-pca .py .png

"""

from __future__ import division
import numpy as np
dot = np.dot
    # import bz.numpyutil as nu
    # dot = nu.pdot

__version__ = "2010-04-14 apr"
__author_email__ = "denis-bz-py at t-online dot de"

#...............................................................................
class PCA:
    def __init__( self, A, fraction=0.90 ):
        assert 0 <= fraction <= 1
            # A = U . diag(d) . Vt, O( m n^2 ), lapack_lite --
        self.U, self.d, self.Vt = np.linalg.svd( A, full_matrices=False )
        assert np.all( self.d[:-1] >= self.d[1:] )  # sorted
        self.eigen = self.d**2
        self.sumvariance = np.cumsum(self.eigen)
        self.sumvariance /= self.sumvariance[-1]
        self.npc = np.searchsorted( self.sumvariance, fraction ) + 1
        self.dinv = np.array([ 1/d if d > self.d[0] * 1e-6  else 0
                                for d in self.d ])

    def pc( self ):
        """ e.g. 1000 x 2 U[:, :npc] * d[:npc], to plot etc. """
        n = self.npc
        return self.U[:, :n] * self.d[:n]

    # These 1-line methods may not be worth the bother;
    # then use U d Vt directly --

    def vars_pc( self, x ):
        n = self.npc
        return self.d[:n] * dot( self.Vt[:n], x.T ).T  # 20 vars -> 2 principal

    def pc_vars( self, p ):
        n = self.npc
        return dot( self.Vt[:n].T, (self.dinv[:n] * p).T ) .T  # 2 PC -> 20 vars

    def pc_obs( self, p ):
        n = self.npc
        return dot( self.U[:, :n], p.T )  # 2 principal -> 1000 obs

    def obs_pc( self, obs ):
        n = self.npc
        return dot( self.U[:, :n].T, obs ) .T  # 1000 obs -> 2 principal

    def obs( self, x ):
        return self.pc_obs( self.vars_pc(x) )  # 20 vars -> 2 principal -> 1000 obs

    def vars( self, obs ):
        return self.pc_vars( self.obs_pc(obs) )  # 1000 obs -> 2 principal -> 20 vars


class Center:
    """ A -= A.mean() /= A.std(), inplace -- use A.copy() if need be
        uncenter(x) == original A . x
    """
        # mttiw
    def __init__( self, A, axis=0, scale=True, verbose=1 ):
        self.mean = A.mean(axis=axis)
        if verbose:
            print "Center -= A.mean:", self.mean
        A -= self.mean
        if scale:
            std = A.std(axis=axis)
            self.std = np.where( std, std, 1. )
            if verbose:
                print "Center /= A.std:", self.std
            A /= self.std
        else:
            self.std = np.ones( A.shape[-1] )
        self.A = A

    def uncenter( self, x ):
        return np.dot( self.A, x * self.std ) + np.dot( x, self.mean )


#...............................................................................
if __name__ == "__main__":
    import sys

    csv = "iris4.csv"  # wikipedia Iris_flower_data_set
        # 5.1,3.5,1.4,0.2  # ,Iris-setosa ...
    N = 1000
    K = 20
    fraction = .90
    seed = 1
    exec "\n".join( sys.argv[1:] )  # N= ...
    np.random.seed(seed)
    np.set_printoptions( 1, threshold=100, suppress=True )  # .1f
    try:
        A = np.genfromtxt( csv, delimiter="," )
        N, K = A.shape
    except IOError:
        A = np.random.normal( size=(N, K) )  # gen correlated ?

    print "csv: %s  N: %d  K: %d  fraction: %.2g" % (csv, N, K, fraction)
    Center(A)
    print "A:", A

    print "PCA ..." ,
    p = PCA( A, fraction=fraction )
    print "npc:", p.npc
    print "% variance:", p.sumvariance * 100

    print "Vt[0], weights that give PC 0:", p.Vt[0]
    print "A . Vt[0]:", dot( A, p.Vt[0] )
    print "pc:", p.pc()

    print "\nobs <-> pc <-> x: with fraction=1, diffs should be ~ 0"
    x = np.ones(K)
    # x = np.ones(( 3, K ))
    print "x:", x
    pc = p.vars_pc(x)  # d' Vt' x
    print "vars_pc(x):", pc
    print "back to ~ x:", p.pc_vars(pc)

    Ax = dot( A, x.T )
    pcx = p.obs(x)  # U' d' Vt' x
    print "Ax:", Ax
    print "A'x:", pcx
    print "max |Ax - A'x|: %.2g" % np.linalg.norm( Ax - pcx, np.inf )

    b = Ax  # ~ back to original x, Ainv A x
    back = p.vars(b)
    print "~ back again:", back
    print "max |back - x|: %.2g" % np.linalg.norm( back - x, np.inf )

# end pca.py

在此处输入图片说明

Months later, here’s a small class PCA, and a picture:

#!/usr/bin/env python
""" a small class for Principal Component Analysis
Usage:
    p = PCA( A, fraction=0.90 )
In:
    A: an array of e.g. 1000 observations x 20 variables, 1000 rows x 20 columns
    fraction: use principal components that account for e.g.
        90 % of the total variance

Out:
    p.U, p.d, p.Vt: from numpy.linalg.svd, A = U . d . Vt
    p.dinv: 1/d or 0, see NR
    p.eigen: the eigenvalues of A*A, in decreasing order (p.d**2).
        eigen[j] / eigen.sum() is variable j's fraction of the total variance;
        look at the first few eigen[] to see how many PCs get to 90 %, 95 % ...
    p.npc: number of principal components,
        e.g. 2 if the top 2 eigenvalues are >= `fraction` of the total.
        It's ok to change this; methods use the current value.

Methods:
    The methods of class PCA transform vectors or arrays of e.g.
    20 variables, 2 principal components and 1000 observations,
    using partial matrices U' d' Vt', parts of the full U d Vt:
    A ~ U' . d' . Vt' where e.g.
        U' is 1000 x 2
        d' is diag([ d0, d1 ]), the 2 largest singular values
        Vt' is 2 x 20.  Dropping the primes,

    d . Vt      2 principal vars = p.vars_pc( 20 vars )
    U           1000 obs = p.pc_obs( 2 principal vars )
    U . d . Vt  1000 obs, p.obs( 20 vars ) = pc_obs( vars_pc( vars ))
        fast approximate A . vars, using the `npc` principal components

    Ut              2 pcs = p.obs_pc( 1000 obs )
    V . dinv        20 vars = p.pc_vars( 2 principal vars )
    V . dinv . Ut   20 vars, p.vars( 1000 obs ) = pc_vars( obs_pc( obs )),
        fast approximate Ainverse . obs: vars that give ~ those obs.


Notes:
    PCA does not center or scale A; you usually want to first
        A -= A.mean(A, axis=0)
        A /= A.std(A, axis=0)
    with the little class Center or the like, below.

See also:
    http://en.wikipedia.org/wiki/Principal_component_analysis
    http://en.wikipedia.org/wiki/Singular_value_decomposition
    Press et al., Numerical Recipes (2 or 3 ed), SVD
    PCA micro-tutorial
    iris-pca .py .png

"""

from __future__ import division
import numpy as np
dot = np.dot
    # import bz.numpyutil as nu
    # dot = nu.pdot

__version__ = "2010-04-14 apr"
__author_email__ = "denis-bz-py at t-online dot de"

#...............................................................................
class PCA:
    def __init__( self, A, fraction=0.90 ):
        assert 0 <= fraction <= 1
            # A = U . diag(d) . Vt, O( m n^2 ), lapack_lite --
        self.U, self.d, self.Vt = np.linalg.svd( A, full_matrices=False )
        assert np.all( self.d[:-1] >= self.d[1:] )  # sorted
        self.eigen = self.d**2
        self.sumvariance = np.cumsum(self.eigen)
        self.sumvariance /= self.sumvariance[-1]
        self.npc = np.searchsorted( self.sumvariance, fraction ) + 1
        self.dinv = np.array([ 1/d if d > self.d[0] * 1e-6  else 0
                                for d in self.d ])

    def pc( self ):
        """ e.g. 1000 x 2 U[:, :npc] * d[:npc], to plot etc. """
        n = self.npc
        return self.U[:, :n] * self.d[:n]

    # These 1-line methods may not be worth the bother;
    # then use U d Vt directly --

    def vars_pc( self, x ):
        n = self.npc
        return self.d[:n] * dot( self.Vt[:n], x.T ).T  # 20 vars -> 2 principal

    def pc_vars( self, p ):
        n = self.npc
        return dot( self.Vt[:n].T, (self.dinv[:n] * p).T ) .T  # 2 PC -> 20 vars

    def pc_obs( self, p ):
        n = self.npc
        return dot( self.U[:, :n], p.T )  # 2 principal -> 1000 obs

    def obs_pc( self, obs ):
        n = self.npc
        return dot( self.U[:, :n].T, obs ) .T  # 1000 obs -> 2 principal

    def obs( self, x ):
        return self.pc_obs( self.vars_pc(x) )  # 20 vars -> 2 principal -> 1000 obs

    def vars( self, obs ):
        return self.pc_vars( self.obs_pc(obs) )  # 1000 obs -> 2 principal -> 20 vars


class Center:
    """ A -= A.mean() /= A.std(), inplace -- use A.copy() if need be
        uncenter(x) == original A . x
    """
        # mttiw
    def __init__( self, A, axis=0, scale=True, verbose=1 ):
        self.mean = A.mean(axis=axis)
        if verbose:
            print "Center -= A.mean:", self.mean
        A -= self.mean
        if scale:
            std = A.std(axis=axis)
            self.std = np.where( std, std, 1. )
            if verbose:
                print "Center /= A.std:", self.std
            A /= self.std
        else:
            self.std = np.ones( A.shape[-1] )
        self.A = A

    def uncenter( self, x ):
        return np.dot( self.A, x * self.std ) + np.dot( x, self.mean )


#...............................................................................
if __name__ == "__main__":
    import sys

    csv = "iris4.csv"  # wikipedia Iris_flower_data_set
        # 5.1,3.5,1.4,0.2  # ,Iris-setosa ...
    N = 1000
    K = 20
    fraction = .90
    seed = 1
    exec "\n".join( sys.argv[1:] )  # N= ...
    np.random.seed(seed)
    np.set_printoptions( 1, threshold=100, suppress=True )  # .1f
    try:
        A = np.genfromtxt( csv, delimiter="," )
        N, K = A.shape
    except IOError:
        A = np.random.normal( size=(N, K) )  # gen correlated ?

    print "csv: %s  N: %d  K: %d  fraction: %.2g" % (csv, N, K, fraction)
    Center(A)
    print "A:", A

    print "PCA ..." ,
    p = PCA( A, fraction=fraction )
    print "npc:", p.npc
    print "% variance:", p.sumvariance * 100

    print "Vt[0], weights that give PC 0:", p.Vt[0]
    print "A . Vt[0]:", dot( A, p.Vt[0] )
    print "pc:", p.pc()

    print "\nobs <-> pc <-> x: with fraction=1, diffs should be ~ 0"
    x = np.ones(K)
    # x = np.ones(( 3, K ))
    print "x:", x
    pc = p.vars_pc(x)  # d' Vt' x
    print "vars_pc(x):", pc
    print "back to ~ x:", p.pc_vars(pc)

    Ax = dot( A, x.T )
    pcx = p.obs(x)  # U' d' Vt' x
    print "Ax:", Ax
    print "A'x:", pcx
    print "max |Ax - A'x|: %.2g" % np.linalg.norm( Ax - pcx, np.inf )

    b = Ax  # ~ back to original x, Ainv A x
    back = p.vars(b)
    print "~ back again:", back
    print "max |back - x|: %.2g" % np.linalg.norm( back - x, np.inf )

# end pca.py

enter image description here


回答 2

使用PCA numpy.linalg.svd非常容易。这是一个简单的演示:

import numpy as np
import matplotlib.pyplot as plt
from scipy.misc import lena

# the underlying signal is a sinusoidally modulated image
img = lena()
t = np.arange(100)
time = np.sin(0.1*t)
real = time[:,np.newaxis,np.newaxis] * img[np.newaxis,...]

# we add some noise
noisy = real + np.random.randn(*real.shape)*255

# (observations, features) matrix
M = noisy.reshape(noisy.shape[0],-1)

# singular value decomposition factorises your data matrix such that:
# 
#   M = U*S*V.T     (where '*' is matrix multiplication)
# 
# * U and V are the singular matrices, containing orthogonal vectors of
#   unit length in their rows and columns respectively.
#
# * S is a diagonal matrix containing the singular values of M - these 
#   values squared divided by the number of observations will give the 
#   variance explained by each PC.
#
# * if M is considered to be an (observations, features) matrix, the PCs
#   themselves would correspond to the rows of S^(1/2)*V.T. if M is 
#   (features, observations) then the PCs would be the columns of
#   U*S^(1/2).
#
# * since U and V both contain orthonormal vectors, U*V.T is equivalent 
#   to a whitened version of M.

U, s, Vt = np.linalg.svd(M, full_matrices=False)
V = Vt.T

# PCs are already sorted by descending order 
# of the singular values (i.e. by the
# proportion of total variance they explain)

# if we use all of the PCs we can reconstruct the noisy signal perfectly
S = np.diag(s)
Mhat = np.dot(U, np.dot(S, V.T))
print "Using all PCs, MSE = %.6G" %(np.mean((M - Mhat)**2))

# if we use only the first 20 PCs the reconstruction is less accurate
Mhat2 = np.dot(U[:, :20], np.dot(S[:20, :20], V[:,:20].T))
print "Using first 20 PCs, MSE = %.6G" %(np.mean((M - Mhat2)**2))

fig, [ax1, ax2, ax3] = plt.subplots(1, 3)
ax1.imshow(img)
ax1.set_title('true image')
ax2.imshow(noisy.mean(0))
ax2.set_title('mean of noisy images')
ax3.imshow((s[0]**(1./2) * V[:,0]).reshape(img.shape))
ax3.set_title('first spatial PC')
plt.show()

PCA using numpy.linalg.svd is super easy. Here’s a simple demo:

import numpy as np
import matplotlib.pyplot as plt
from scipy.misc import lena

# the underlying signal is a sinusoidally modulated image
img = lena()
t = np.arange(100)
time = np.sin(0.1*t)
real = time[:,np.newaxis,np.newaxis] * img[np.newaxis,...]

# we add some noise
noisy = real + np.random.randn(*real.shape)*255

# (observations, features) matrix
M = noisy.reshape(noisy.shape[0],-1)

# singular value decomposition factorises your data matrix such that:
# 
#   M = U*S*V.T     (where '*' is matrix multiplication)
# 
# * U and V are the singular matrices, containing orthogonal vectors of
#   unit length in their rows and columns respectively.
#
# * S is a diagonal matrix containing the singular values of M - these 
#   values squared divided by the number of observations will give the 
#   variance explained by each PC.
#
# * if M is considered to be an (observations, features) matrix, the PCs
#   themselves would correspond to the rows of S^(1/2)*V.T. if M is 
#   (features, observations) then the PCs would be the columns of
#   U*S^(1/2).
#
# * since U and V both contain orthonormal vectors, U*V.T is equivalent 
#   to a whitened version of M.

U, s, Vt = np.linalg.svd(M, full_matrices=False)
V = Vt.T

# PCs are already sorted by descending order 
# of the singular values (i.e. by the
# proportion of total variance they explain)

# if we use all of the PCs we can reconstruct the noisy signal perfectly
S = np.diag(s)
Mhat = np.dot(U, np.dot(S, V.T))
print "Using all PCs, MSE = %.6G" %(np.mean((M - Mhat)**2))

# if we use only the first 20 PCs the reconstruction is less accurate
Mhat2 = np.dot(U[:, :20], np.dot(S[:20, :20], V[:,:20].T))
print "Using first 20 PCs, MSE = %.6G" %(np.mean((M - Mhat2)**2))

fig, [ax1, ax2, ax3] = plt.subplots(1, 3)
ax1.imshow(img)
ax1.set_title('true image')
ax2.imshow(noisy.mean(0))
ax2.set_title('mean of noisy images')
ax3.imshow((s[0]**(1./2) * V[:,0]).reshape(img.shape))
ax3.set_title('first spatial PC')
plt.show()

回答 3

您可以使用sklearn:

import sklearn.decomposition as deco
import numpy as np

x = (x - np.mean(x, 0)) / np.std(x, 0) # You need to normalize your data first
pca = deco.PCA(n_components) # n_components is the components number after reduction
x_r = pca.fit(x).transform(x)
print ('explained variance (first %d components): %.2f'%(n_components, sum(pca.explained_variance_ratio_)))

You can use sklearn:

import sklearn.decomposition as deco
import numpy as np

x = (x - np.mean(x, 0)) / np.std(x, 0) # You need to normalize your data first
pca = deco.PCA(n_components) # n_components is the components number after reduction
x_r = pca.fit(x).transform(x)
print ('explained variance (first %d components): %.2f'%(n_components, sum(pca.explained_variance_ratio_)))

回答 4


回答 5

SVD应该可以在460尺寸上正常工作。在我的Atom上网本上大约需要7秒钟。eig()方法花费更多的时间(它应该使用更多的浮点运算),并且几乎总是精度较低。

如果您的示例少于460个,则您要对角化散布矩阵(x-datamean)^ T(x-mean),假设您的数据点为列,然后向左乘以(x-datamean)。如果您的维数多于数据,那可能会更快。

SVD should work fine with 460 dimensions. It takes about 7 seconds on my Atom netbook. The eig() method takes more time (as it should, it uses more floating point operations) and will almost always be less accurate.

If you have less than 460 examples then what you want to do is diagonalize the scatter matrix (x – datamean)^T(x – mean), assuming your data points are columns, and then left-multiplying by (x – datamean). That might be faster in the case where you have more dimensions than data.


回答 6

您可以很容易地使用scipy.linalg(假设预先居中的数据集data)“滚动”自己的数据:

covmat = data.dot(data.T)
evs, evmat = scipy.linalg.eig(covmat)

然后evs是您的特征值,evmat就是您的投影矩阵。

如果要保留d尺寸,请使用第一个d特征值和第一个d特征向量。

假设scipy.linalg具有分解和numpy个矩阵乘法,您还需要什么?

You can quite easily “roll” your own using scipy.linalg (assuming a pre-centered dataset data):

covmat = data.dot(data.T)
evs, evmat = scipy.linalg.eig(covmat)

Then evs are your eigenvalues, and evmat is your projection matrix.

If you want to keep d dimensions, use the first d eigenvalues and first d eigenvectors.

Given that scipy.linalg has the decomposition and numpy the matrix multiplications, what else do you need?


回答 7

我刚读完《机器学习:算法观点》一书。本书中的所有代码示例都是由Python(以及几乎所有的Nu​​mpy)编写的。chatper10.2主成分分析的代码片段可能值得一读。它使用numpy.linalg.eig。
顺便说一句,我认为SVD可以很好地处理460 * 460尺寸。我已经在一个非常旧的PC:Pentium III 733mHz上使用numpy / scipy.linalg.svd计算出6500 * 6500 SVD。老实说,脚本需要大量内存(约1.xG)和大量时间(约30分钟)才能获得SVD结果。但是我认为,除非您需要大量执行SVD,否则现代PC上的460 * 460并不是什么大问题。

I just finish reading the book Machine Learning: An Algorithmic Perspective. All code examples in the book was written by Python(and almost with Numpy). The code snippet of chatper10.2 Principal Components Analysis maybe worth a reading. It use numpy.linalg.eig.
By the way, I think SVD can handle 460 * 460 dimensions very well. I have calculate a 6500*6500 SVD with numpy/scipy.linalg.svd on a very old PC:Pentium III 733mHz. To be honest, the script needs a lot of memory(about 1.xG) and a lot of time(about 30 minutes) to get the SVD result. But I think 460*460 on a modern PC will not be a big problem unless u need do SVD a huge number of times.


回答 8

您不需要完全奇异值分解(SVD),因为它可以计算所有特征值和特征向量,并且对于大型矩阵可能是禁止的。 scipy及其稀疏模块提供了适用于稀疏和密集矩阵的通用线性代数函数,其中包括eig *系列函数:

http://docs.scipy.org/doc/scipy/reference/sparse.linalg.html#matrix-factorizations

Scikit-learn提供了Python PCA实现,目前仅支持密集矩阵。

时间:

In [1]: A = np.random.randn(1000, 1000)

In [2]: %timeit scipy.sparse.linalg.eigsh(A)
1 loops, best of 3: 802 ms per loop

In [3]: %timeit np.linalg.svd(A)
1 loops, best of 3: 5.91 s per loop

You do not need full Singular Value Decomposition (SVD) at it computes all eigenvalues and eigenvectors and can be prohibitive for large matrices. scipy and its sparse module provide generic linear algrebra functions working on both sparse and dense matrices, among which there is the eig* family of functions :

http://docs.scipy.org/doc/scipy/reference/sparse.linalg.html#matrix-factorizations

Scikit-learn provides a Python PCA implementation which only support dense matrices for now.

Timings :

In [1]: A = np.random.randn(1000, 1000)

In [2]: %timeit scipy.sparse.linalg.eigsh(A)
1 loops, best of 3: 802 ms per loop

In [3]: %timeit np.linalg.svd(A)
1 loops, best of 3: 5.91 s per loop

回答 9

是使用numpy,scipy和C扩展名的python PCA模块的另一种实现。该模块使用SVD或在C中实现的NIPALS(非线性迭代部分最小二乘)算法执行PCA。

Here is another implementation of a PCA module for python using numpy, scipy and C-extensions. The module carries out PCA using either a SVD or the NIPALS (Nonlinear Iterative Partial Least Squares) algorithm which is implemented in C.


回答 10

如果您正在使用3D向量,则可以使用toolbelt vg简洁地应用SVD 。它是numpy之上的一个浅层。

import numpy as np
import vg

vg.principal_components(data)

如果只需要第一个主成分,则还有一个方便的别名:

vg.major_axis(data)

我在上次启动时创建了该库,其灵感来自于以下用途:在NumPy中冗长或不透明的简单想法。

If you’re working with 3D vectors, you can apply SVD concisely using the toolbelt vg. It’s a light layer on top of numpy.

import numpy as np
import vg

vg.principal_components(data)

There’s also a convenient alias if you only want the first principal component:

vg.major_axis(data)

I created the library at my last startup, where it was motivated by uses like this: simple ideas which are verbose or opaque in NumPy.


索引所有*除外* python中的一项

问题:索引所有*除外* python中的一项

有没有一种简单的方法来索引列表(或数组,或其他任何东西)中特定索引之外的所有元素?例如,

  • mylist[3] 将把该物品退回第3位

  • milist[~3] 将返回整个列表,除了3

Is there a simple way to index all elements of a list (or array, or whatever) except for a particular index? E.g.,

  • mylist[3] will return the item in position 3

  • milist[~3] will return the whole list except for 3


回答 0

对于列表,您可以使用列表组合。例如,要制作不含第3个元素b的副本a

a = range(10)[::-1]                       # [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
b = [x for i,x in enumerate(a) if i!=3]   # [9, 8, 7, 5, 4, 3, 2, 1, 0]

这是非常通用的方法,可用于所有可迭代对象,包括numpy数组。如果您替换[]()b将是一个迭代器,而非列表。

或者,您可以通过以下方式就地完成此操作pop

a = range(10)[::-1]     # a = [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
a.pop(3)                # a = [9, 8, 7, 5, 4, 3, 2, 1, 0]

numpy中,您可以使用布尔索引来做到这一点:

a = np.arange(9, -1, -1)     # a = array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
b = a[np.arange(len(a))!=3]  # b = array([9, 8, 7, 5, 4, 3, 2, 1, 0])

通常,这比上面列出的列表理解要快得多。

For a list, you could use a list comp. For example, to make b a copy of a without the 3rd element:

a = range(10)[::-1]                       # [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
b = [x for i,x in enumerate(a) if i!=3]   # [9, 8, 7, 5, 4, 3, 2, 1, 0]

This is very general, and can be used with all iterables, including numpy arrays. If you replace [] with (), b will be an iterator instead of a list.

Or you could do this in-place with pop:

a = range(10)[::-1]     # a = [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
a.pop(3)                # a = [9, 8, 7, 5, 4, 3, 2, 1, 0]

In numpy you could do this with a boolean indexing:

a = np.arange(9, -1, -1)     # a = array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
b = a[np.arange(len(a))!=3]  # b = array([9, 8, 7, 5, 4, 3, 2, 1, 0])

which will, in general, be much faster than the list comprehension listed above.


回答 1

>>> l = range(1,10)
>>> l
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> l[:2] 
[1, 2]
>>> l[3:]
[4, 5, 6, 7, 8, 9]
>>> l[:2] + l[3:]
[1, 2, 4, 5, 6, 7, 8, 9]
>>> 

也可以看看

解释Python的切片符号

>>> l = range(1,10)
>>> l
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> l[:2] 
[1, 2]
>>> l[3:]
[4, 5, 6, 7, 8, 9]
>>> l[:2] + l[3:]
[1, 2, 4, 5, 6, 7, 8, 9]
>>> 

See also

Explain Python’s slice notation


回答 2

我发现的最简单的方法是:

mylist[:x]+mylist[x+1:]

这将产生mylist没有index的元素x

The simplest way I found was:

mylist[:x] + mylist[x+1:]

that will produce your mylist without the element at index x.

Example

mylist = [0, 1, 2, 3, 4, 5]
x = 3
mylist[:x] + mylist[x+1:]

Result produced

mylist = [0, 1, 2, 4, 5]

回答 3

如果您使用的是numpy,则我认为最接近的是使用蒙版

>>> import numpy as np
>>> arr = np.arange(1,10)
>>> mask = np.ones(arr.shape,dtype=bool)
>>> mask[5]=0
>>> arr[mask]
array([1, 2, 3, 4, 5, 7, 8, 9])

如果itertools没有,可以达到类似的效果numpy

>>> from itertools import compress
>>> arr = range(1,10)
>>> mask = [1]*len(arr)
>>> mask[5]=0
>>> list(compress(arr,mask))
[1, 2, 3, 4, 5, 7, 8, 9]

If you are using numpy, the closest, I can think of is using a mask

>>> import numpy as np
>>> arr = np.arange(1,10)
>>> mask = np.ones(arr.shape,dtype=bool)
>>> mask[5]=0
>>> arr[mask]
array([1, 2, 3, 4, 5, 7, 8, 9])

Something similar can be achieved using itertools without numpy

>>> from itertools import compress
>>> arr = range(1,10)
>>> mask = [1]*len(arr)
>>> mask[5]=0
>>> list(compress(arr,mask))
[1, 2, 3, 4, 5, 7, 8, 9]

回答 4

使用np.delete!它实际上并没有删除任何内容

例:

import numpy as np
a = np.array([[1,4],[5,7],[3,1]])                                       

# a: array([[1, 4],
#           [5, 7],
#           [3, 1]])

ind = np.array([0,1])                                                   

# ind: array([0, 1])

# a[ind]: array([[1, 4],
#                [5, 7]])

all_except_index = np.delete(a, ind, axis=0)                                              
# all_except_index: array([[3, 1]])

# a: (still the same): array([[1, 4],
#                             [5, 7],
#                             [3, 1]])

Use np.delete ! It does not actually delete anything inplace

Example:

import numpy as np
a = np.array([[1,4],[5,7],[3,1]])                                       

# a: array([[1, 4],
#           [5, 7],
#           [3, 1]])

ind = np.array([0,1])                                                   

# ind: array([0, 1])

# a[ind]: array([[1, 4],
#                [5, 7]])

all_except_index = np.delete(a, ind, axis=0)                                              
# all_except_index: array([[3, 1]])

# a: (still the same): array([[1, 4],
#                             [5, 7],
#                             [3, 1]])

回答 5

我将提供一种功能(不变)的方法。

  1. 做到这一点的标准和简单方法是使用切片:

    index_to_remove = 3
    data = [*range(5)]
    new_data = data[:index_to_remove] + data[index_to_remove + 1:]
    
    print(f"data: {data}, new_data: {new_data}")

    输出:

    data: [0, 1, 2, 3, 4], new_data: [0, 1, 2, 4]
  2. 使用清单理解:

    data = [*range(5)]
    new_data = [v for i, v in enumerate(data) if i != index_to_remove]
    
    print(f"data: {data}, new_data: {new_data}") 

    输出:

    data: [0, 1, 2, 3, 4], new_data: [0, 1, 2, 4]
  3. 使用过滤功能:

    index_to_remove = 3
    data = [*range(5)]
    new_data = [*filter(lambda i: i != index_to_remove, data)]

    输出:

    data: [0, 1, 2, 3, 4], new_data: [0, 1, 2, 4]
  4. 使用遮罩。屏蔽由标准库中的itertools.compress函数提供:

    from itertools import compress
    
    index_to_remove = 3
    data = [*range(5)]
    mask = [1] * len(data)
    mask[index_to_remove] = 0
    new_data = [*compress(data, mask)]
    
    print(f"data: {data}, mask: {mask}, new_data: {new_data}")

    输出:

    data: [0, 1, 2, 3, 4], mask: [1, 1, 1, 0, 1], new_data: [0, 1, 2, 4]
  5. 使用Python标准库中的itertools.filterfalse函数

    from itertools import filterfalse
    
    index_to_remove = 3
    data = [*range(5)]
    new_data = [*filterfalse(lambda i: i == index_to_remove, data)]
    
    print(f"data: {data}, new_data: {new_data}")

    输出:

    data: [0, 1, 2, 3, 4], new_data: [0, 1, 2, 4]

I’m going to provide a functional (immutable) way of doing it.

  1. The standard and easy way of doing it is to use slicing:

    index_to_remove = 3
    data = [*range(5)]
    new_data = data[:index_to_remove] + data[index_to_remove + 1:]
    
    print(f"data: {data}, new_data: {new_data}")
    

    Output:

    data: [0, 1, 2, 3, 4], new_data: [0, 1, 2, 4]
    
  2. Use list comprehension:

    data = [*range(5)]
    new_data = [v for i, v in enumerate(data) if i != index_to_remove]
    
    print(f"data: {data}, new_data: {new_data}") 
    

    Output:

    data: [0, 1, 2, 3, 4], new_data: [0, 1, 2, 4]
    
  3. Use filter function:

    index_to_remove = 3
    data = [*range(5)]
    new_data = [*filter(lambda i: i != index_to_remove, data)]
    

    Output:

    data: [0, 1, 2, 3, 4], new_data: [0, 1, 2, 4]
    
  4. Using masking. Masking is provided by itertools.compress function in the standard library:

    from itertools import compress
    
    index_to_remove = 3
    data = [*range(5)]
    mask = [1] * len(data)
    mask[index_to_remove] = 0
    new_data = [*compress(data, mask)]
    
    print(f"data: {data}, mask: {mask}, new_data: {new_data}")
    

    Output:

    data: [0, 1, 2, 3, 4], mask: [1, 1, 1, 0, 1], new_data: [0, 1, 2, 4]
    
  5. Use itertools.filterfalse function from Python standard library

    from itertools import filterfalse
    
    index_to_remove = 3
    data = [*range(5)]
    new_data = [*filterfalse(lambda i: i == index_to_remove, data)]
    
    print(f"data: {data}, new_data: {new_data}")
    

    Output:

    data: [0, 1, 2, 3, 4], new_data: [0, 1, 2, 4]
    

回答 6

如果您事先不知道索引,这里的功能将起作用

def reverse_index(l, index):
    try:
        l.pop(index)
        return l
    except IndexError:
        return False

If you don’t know the index beforehand here is a function that will work

def reverse_index(l, index):
    try:
        l.pop(index)
        return l
    except IndexError:
        return False

回答 7

请注意,如果变量是列表列表,则某些方法将失败。例如:

v1 = [[range(3)] for x in range(4)]
v2 = v1[:3]+v1[4:] # this fails
v2

对于一般情况,使用

removed_index = 1
v1 = [[range(3)] for x in range(4)]
v2 = [x for i,x in enumerate(v1) if x!=removed_index]
v2

Note that if variable is list of lists, some approaches would fail. For example:

v1 = [[range(3)] for x in range(4)]
v2 = v1[:3]+v1[4:] # this fails
v2

For the general case, use

removed_index = 1
v1 = [[range(3)] for x in range(4)]
v2 = [x for i,x in enumerate(v1) if x!=removed_index]
v2

回答 8

如果要剪掉最后一个或第一个,请执行以下操作:

list = ["This", "is", "a", "list"]
listnolast = list[:-1]
listnofirst = list[1:]

如果将1更改为2,则将删除前2个字符,而不是第二个。希望这对您有所帮助!

If you want to cut out the last or the first do this:

list = ["This", "is", "a", "list"]
listnolast = list[:-1]
listnofirst = list[1:]

If you change 1 to 2 the first 2 characters will be removed not the second. Hope this still helps!