Python 实用宝典

Question 1

I am getting the following error while trying to import from sklearn:

>>> from sklearn import svm

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
   from sklearn import svm
  File "C:\Python27\lib\site-packages\sklearn\__init__.py", line 16, in <module>
   from . import check_build
ImportError: cannot import name check_build

I am using python 2.7, scipy-0.12.0b1 superpack, numpy-1.6.0 superpack, scikit-learn-0.11 I have a windows 7 machine

I have checked several answers for this issue but none of them gives a way out of this error.

Question 2

Worked for me after installing scipy.

Question 3

>>> from sklearn import preprocessing, metrics, cross_validation

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    from sklearn import preprocessing, metrics, cross_validation
  File "D:\Python27\lib\site-packages\sklearn\__init__.py", line 31, in <module>
    from . import __check_build
ImportError: cannot import name __check_build
>>> ================================ RESTART ================================
>>> from sklearn import preprocessing, metrics, cross_validation
>>>

So, simply try to restart the shell!

Question 4

My solution for Python 3.6.5 64-bit Windows 10:

pip uninstall sklearn
pip uninstall scikit-learn
pip install sklearn

No need to restart command-line but you can do this if you want. It took me one day to fix this bug. Hope this help.

Question 5

After installing numpy , scipy ,sklearn still has error

Solution:

Setting Up System Path Variable for Python & the PYTHONPATH Environment Variable

System Variables: add C:\Python34 into path User Variables: add new: (name)PYTHONPATH (value)C:\Python34\Lib\site-packages;

Question 6

Usually when I get these kinds of errors, opening the __init__.py file and poking around helps. Go to the directory C:\Python27\lib\site-packages\sklearn and ensure that there’s a sub-directory called __check_build as a first step. On my machine (with a working sklearn installation, Mac OSX, Python 2.7.3) I have __init__.py, setup.py, their associated .pyc files, and a binary _check_build.so.

Poking around the __init__.py in that directory, the next step I’d take is to go to sklearn/__init__.py and comment out the import statement—the check_build stuff just checks that things were compiled correctly, it doesn’t appear to do anything but call a precompiled binary. This is, of course, at your own risk, and (to be sure) a work around. If your build failed you’ll likely soon run into other, bigger problems.

Question 7

I had the same issue on Windows. Solved it by installing Numpy+MKL from http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy (there it’s recommended to install numpy+mkl before other packages that depend on it) as suggested by this answer.

Question 8

I had problems importing SKLEARN after installing a new 64bit version of Python 3.4 from python.org.

Turns out that it was the SCIPY module that was broken, and alos failed when I tried to “import scipy”.

Solution was to uninstall scipy and reinstall it with pip3:

C:\> pip uninstall scipy

[lots of reporting messages deleted]

Proceed (y/n)? y
  Successfully uninstalled scipy-1.0.0

C:\Users\>pip3 install scipy

Collecting scipy
  Downloading scipy-1.0.0-cp36-none-win_amd64.whl (30.8MB)
    100% |████████████████████████████████| 30.8MB 33kB/s
Requirement already satisfied: numpy>=1.8.2 in c:\users\johnmccurdy\appdata\loca
l\programs\python\python36\lib\site-packages (from scipy)
Installing collected packages: scipy
Successfully installed scipy-1.0.0

C:\Users>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)]
 on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy
>>>
>>> import sklearn
>>>

Question 9

If you use Anaconda 2.7 64 bit, try

conda upgrade scikit-learn

and restart the python shell, that works for me.

Second edit when I faced the same problem and solved it:

conda upgrade scikit-learn

also works for me

Question 10

None of the other answers worked for me. After some tinkering I unsinstalled sklearn:

pip uninstall sklearn

Then I removed sklearn folder from here: (adjust the path to your system and python version)

C:\Users\%USERNAME%\AppData\Roaming\Python\Python36\site-packages

And the installed it from wheel from this site: link

The error was there probably because of a version conflict with sklearn installed somewhere else.

Question 11

For me, I was upgrading the existing code into new setup by installing Anaconda from fresh with latest python version(3.7) For this,

from sklearn import cross_validation, 
from sklearn.grid_search import GridSearchCV

to

from sklearn.model_selection import GridSearchCV,cross_validate

Question 12

no need to uninstall & then re-install sklearn

try this:

from sklearn.model_selection import train_test_split

Question 13

i had the same problem reinstalling anaconda solved the issue for me

Question 14

In windows:

I tried to delete sklearn from the shell: pip uninstall sklearn, and re install it but doesn’t work ..

the solution:

1- open the cmd shell.
2- cd c:\pythonVERSION\scripts
3- pip uninstall sklearn
4- open in the explorer: C:\pythonVERSION\Lib\site-packages
5- look for the folders that contains sklearn and delete them ..
6- back to cmd: pip install sklearn

Question 15

I need to fit RandomForestRegressor from sklearn.ensemble.

forest = ensemble.RandomForestRegressor(**RF_tuned_parameters)
model = forest.fit(train_fold, train_y)
yhat = model.predict(test_fold)

This code always worked until I made some preprocessing of data (train_y). The error message says:

DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

model = forest.fit(train_fold, train_y)

Previously train_y was a Series, now it’s numpy array (it is a column-vector). If I apply train_y.ravel(), then it becomes a row vector and no error message appears, through the prediction step takes very long time (actually it never finishes…).

In the docs of RandomForestRegressor I found that train_y should be defined as y : array-like, shape = [n_samples] or [n_samples, n_outputs] Any idea how to solve this issue?

Question 16

Change this line:

model = forest.fit(train_fold, train_y)

to:

model = forest.fit(train_fold, train_y.values.ravel())

Edit:

.values will give the values in an array. (shape: (n,1)

.ravel will convert that array shape to (n, )

Question 17

I also encountered this situation when I was trying to train a KNN classifier. but it seems that the warning was gone after I changed:
knn.fit(X_train,y_train)
to
knn.fit(X_train, np.ravel(y_train,order='C'))

Ahead of this line I used import numpy as np.

Question 18

I had the same problem. The problem was that the labels were in a column format while it expected it in a row. use np.ravel()

knn.score(training_set, np.ravel(training_labels))

Hope this solves it.

Question 19

use below code:

model = forest.fit(train_fold, train_y.ravel())

if you are still getting slap by error as identical as below ?

Unknown label type: %r" % y

use this code:

y = train_y.ravel()
train_y = np.array(y).astype(int)
model = forest.fit(train_fold, train_y)

Question 20

Another way of doing this is to use ravel

model = forest.fit(train_fold, train_y.values.reshape(-1,))

Question 21

With neuraxle, you can easily solve this :

p = Pipeline([
   # expected outputs shape: (n, 1)
   OutputTransformerWrapper(NumpyRavel()), 
   # expected outputs shape: (n, )
   RandomForestRegressor(**RF_tuned_parameters)
])

p, outputs = p.fit_transform(data_inputs, expected_outputs)

Neuraxle is a sklearn-like framework for hyperparameter tuning and AutoML in deep learning projects !

Question 22

format_train_y=[]
for n in train_y:
    format_train_y.append(n[0])

Question 23

Y = y.values[:,0]

Y – formated_train_y

y – train_y

Question 24

Can you suggest a module function from numpy/scipy that can find local maxima/minima in a 1D numpy array? Obviously the simplest approach ever is to have a look at the nearest neighbours, but I would like to have an accepted solution that is part of the numpy distro.

Question 25

If you are looking for all entries in the 1d array a smaller than their neighbors, you can try

numpy.r_[True, a[1:] < a[:-1]] & numpy.r_[a[:-1] < a[1:], True]

You could also smooth your array before this step using numpy.convolve().

I don’t think there is a dedicated function for this.

Question 26

In SciPy >= 0.11

import numpy as np
from scipy.signal import argrelextrema

x = np.random.random(12)

# for local maxima
argrelextrema(x, np.greater)

# for local minima
argrelextrema(x, np.less)

Produces

>>> x
array([ 0.56660112,  0.76309473,  0.69597908,  0.38260156,  0.24346445,
    0.56021785,  0.24109326,  0.41884061,  0.35461957,  0.54398472,
    0.59572658,  0.92377974])
>>> argrelextrema(x, np.greater)
(array([1, 5, 7]),)
>>> argrelextrema(x, np.less)
(array([4, 6, 8]),)

Note, these are the indices of x that are local max/min. To get the values, try:

>>> x[argrelextrema(x, np.greater)[0]]

scipy.signal also provides argrelmax and argrelmin for finding maxima and minima respectively.

Question 27

For curves with not too much noise, I recommend the following small code snippet:

from numpy import *

# example data with some peaks:
x = linspace(0,4,1e3)
data = .2*sin(10*x)+ exp(-abs(2-x)**2)

# that's the line, you need:
a = diff(sign(diff(data))).nonzero()[0] + 1 # local min+max
b = (diff(sign(diff(data))) > 0).nonzero()[0] + 1 # local min
c = (diff(sign(diff(data))) < 0).nonzero()[0] + 1 # local max


# graphical output...
from pylab import *
plot(x,data)
plot(x[b], data[b], "o", label="min")
plot(x[c], data[c], "o", label="max")
legend()
show()

The +1 is important, because diff reduces the original index number.

Question 28

Another approach (more words, less code) that may help:

The locations of local maxima and minima are also the locations of the zero crossings of the first derivative. It is generally much easier to find zero crossings than it is to directly find local maxima and minima.

Unfortunately, the first derivative tends to “amplify” noise, so when significant noise is present in the original data, the first derivative is best used only after the original data has had some degree of smoothing applied.

Since smoothing is, in the simplest sense, a low pass filter, the smoothing is often best (well, most easily) done by using a convolution kernel, and “shaping” that kernel can provide a surprising amount of feature-preserving/enhancing capability. The process of finding an optimal kernel can be automated using a variety of means, but the best may be simple brute force (plenty fast for finding small kernels). A good kernel will (as intended) massively distort the original data, but it will NOT affect the location of the peaks/valleys of interest.

Fortunately, quite often a suitable kernel can be created via a simple SWAG (“educated guess”). The width of the smoothing kernel should be a little wider than the widest expected “interesting” peak in the original data, and its shape will resemble that peak (a single-scaled wavelet). For mean-preserving kernels (what any good smoothing filter should be) the sum of the kernel elements should be precisely equal to 1.00, and the kernel should be symmetric about its center (meaning it will have an odd number of elements.

Given an optimal smoothing kernel (or a small number of kernels optimized for different data content), the degree of smoothing becomes a scaling factor for (the “gain” of) the convolution kernel.

Determining the “correct” (optimal) degree of smoothing (convolution kernel gain) can even be automated: Compare the standard deviation of the first derivative data with the standard deviation of the smoothed data. How the ratio of the two standard deviations changes with changes in the degree of smoothing cam be used to predict effective smoothing values. A few manual data runs (that are truly representative) should be all that’s needed.

All the prior solutions posted above compute the first derivative, but they don’t treat it as a statistical measure, nor do the above solutions attempt to performing feature preserving/enhancing smoothing (to help subtle peaks “leap above” the noise).

Finally, the bad news: Finding “real” peaks becomes a royal pain when the noise also has features that look like real peaks (overlapping bandwidth). The next more-complex solution is generally to use a longer convolution kernel (a “wider kernel aperture”) that takes into account the relationship between adjacent “real” peaks (such as minimum or maximum rates for peak occurrence), or to use multiple convolution passes using kernels having different widths (but only if it is faster: it is a fundamental mathematical truth that linear convolutions performed in sequence can always be convolved together into a single convolution). But it is often far easier to first find a sequence of useful kernels (of varying widths) and convolve them together than it is to directly find the final kernel in a single step.

Hopefully this provides enough info to let Google (and perhaps a good stats text) fill in the gaps. I really wish I had the time to provide a worked example, or a link to one. If anyone comes across one online, please post it here!

Question 29

As of SciPy version 1.1, you can also use find_peaks. Below are two examples taken from the documentation itself.

Using the height argument, one can select all maxima above a certain threshold (in this example, all non-negative maxima; this can be very useful if one has to deal with a noisy baseline; if you want to find minima, just multiply you input by -1):

import matplotlib.pyplot as plt
from scipy.misc import electrocardiogram
from scipy.signal import find_peaks
import numpy as np

x = electrocardiogram()[2000:4000]
peaks, _ = find_peaks(x, height=0)
plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.plot(np.zeros_like(x), "--", color="gray")
plt.show()

Another extremely helpful argument is distance, which defines the minimum distance between two peaks:

peaks, _ = find_peaks(x, distance=150)
# difference between peaks is >= 150
print(np.diff(peaks))
# prints [186 180 177 171 177 169 167 164 158 162 172]

plt.plot(x)
plt.plot(peaks, x[peaks], "x")
plt.show()

Question 30

Why not use Scipy built-in function signal.find_peaks_cwt to do the job ?

from scipy import signal
import numpy as np

#generate junk data (numpy 1D arr)
xs = np.arange(0, np.pi, 0.05)
data = np.sin(xs)

# maxima : use builtin function to find (max) peaks
max_peakind = signal.find_peaks_cwt(data, np.arange(1,10))

# inverse  (in order to find minima)
inv_data = 1/data
# minima : use builtin function fo find (min) peaks (use inversed data)
min_peakind = signal.find_peaks_cwt(inv_data, np.arange(1,10))

#show results
print "maxima",  data[max_peakind]
print "minima",  data[min_peakind]

results:

maxima [ 0.9995736]
minima [ 0.09146464]

Regards

Question 31

Update: I wasn’t happy with gradient so I found it more reliable to use numpy.diff. Please let me know if it does what you want.

Regarding the issue of noise, the mathematical problem is to locate maxima/minima if we want to look at noise we can use something like convolve which was mentioned earlier.

import numpy as np
from matplotlib import pyplot

a=np.array([10.3,2,0.9,4,5,6,7,34,2,5,25,3,-26,-20,-29],dtype=np.float)

gradients=np.diff(a)
print gradients


maxima_num=0
minima_num=0
max_locations=[]
min_locations=[]
count=0
for i in gradients[:-1]:
        count+=1

    if ((cmp(i,0)>0) & (cmp(gradients[count],0)<0) & (i != gradients[count])):
        maxima_num+=1
        max_locations.append(count)     

    if ((cmp(i,0)<0) & (cmp(gradients[count],0)>0) & (i != gradients[count])):
        minima_num+=1
        min_locations.append(count)


turning_points = {'maxima_number':maxima_num,'minima_number':minima_num,'maxima_locations':max_locations,'minima_locations':min_locations}  

print turning_points

pyplot.plot(a)
pyplot.show()

Question 32

While this question is really old. I believe there is a much simpler approach in numpy (a one liner).

import numpy as np

list = [1,3,9,5,2,5,6,9,7]

np.diff(np.sign(np.diff(list))) #the one liner

#output
array([ 0, -2,  0,  2,  0,  0, -2])

To find a local max or min we essentially want to find when the difference between the values in the list (3-1, 9-3…) changes from positive to negative (max) or negative to positive (min). Therefore, first we find the difference. Then we find the sign, and then we find the changes in sign by taking the difference again. (Sort of like a first and second derivative in calculus, only we have discrete data and don’t have a continuous function.)

The output in my example does not contain the extrema (the first and last values in the list). Also, just like calculus, if the second derivative is negative, you have max, and if it is positive you have a min.

Thus we have the following matchup:

[1,  3,  9,  5,  2,  5,  6,  9,  7]
    [0, -2,  0,  2,  0,  0, -2]
        Max     Min         Max

Question 33

None of these solutions worked for me since I wanted to find peaks in the center of repeating values as well. for example, in

ar = np.array([0,1,2,2,2,1,3,3,3,2,5,0])

the answer should be

array([ 3,  7, 10], dtype=int64)

I did this using a loop. I know it’s not super clean, but it gets the job done.

def findLocalMaxima(ar):
# find local maxima of array, including centers of repeating elements    
maxInd = np.zeros_like(ar)
peakVar = -np.inf
i = -1
while i < len(ar)-1:
#for i in range(len(ar)):
    i += 1
    if peakVar < ar[i]:
        peakVar = ar[i]
        for j in range(i,len(ar)):
            if peakVar < ar[j]:
                break
            elif peakVar == ar[j]:
                continue
            elif peakVar > ar[j]:
                peakInd = i + np.floor(abs(i-j)/2)
                maxInd[peakInd.astype(int)] = 1
                i = j
                break
    peakVar = ar[i]
maxInd = np.where(maxInd)[0]
return maxInd

Question 34

import numpy as np
x=np.array([6,3,5,2,1,4,9,7,8])
y=np.array([2,1,3,5,3,9,8,10,7])
sortId=np.argsort(x)
x=x[sortId]
y=y[sortId]
minm = np.array([])
maxm = np.array([])
i = 0
while i < length-1:
    if i < length - 1:
        while i < length-1 and y[i+1] >= y[i]:
            i+=1

        if i != 0 and i < length-1:
            maxm = np.append(maxm,i)

        i+=1

    if i < length - 1:
        while i < length-1 and y[i+1] <= y[i]:
            i+=1

        if i < length-1:
            minm = np.append(minm,i)
        i+=1


print minm
print maxm

minm and maxm contain indices of minima and maxima, respectively. For a huge data set, it will give lots of maximas/minimas so in that case smooth the curve first and then apply this algorithm.

Question 35

Another solution using essentially a dilate operator:

import numpy as np
from scipy.ndimage import rank_filter

def find_local_maxima(x):
   x_dilate = rank_filter(x, -1, size=3)
   return x_dilate == x

and for the minima:

def find_local_minima(x):
   x_erode = rank_filter(x, -0, size=3)
   return x_erode == x

Also, from scipy.ndimage you can replace rank_filter(x, -1, size=3) with grey_dilation and rank_filter(x, 0, size=3) with grey_erosion. This won’t require a local sort, so it is slightly faster.

Question 36

Another one:


def local_maxima_mask(vec):
    """
    Get a mask of all points in vec which are local maxima
    :param vec: A real-valued vector
    :return: A boolean mask of the same size where True elements correspond to maxima. 
    """
    mask = np.zeros(vec.shape, dtype=np.bool)
    greater_than_the_last = np.diff(vec)>0  # N-1
    mask[1:] = greater_than_the_last
    mask[:-1] &= ~greater_than_the_last
    return mask

Question 37

How to convert

["1.1", "2.2", "3.2"]

to

[1.1, 2.2, 3.2]

in NumPy?

Question 38

Well, if you’re reading the data in as a list, just do np.array(map(float, list_of_strings)) (or equivalently, use a list comprehension). (In Python 3, you’ll need to call list on the map return value if you use map, since map returns an iterator now.)

However, if it’s already a numpy array of strings, there’s a better way. Use astype().

import numpy as np
x = np.array(['1.1', '2.2', '3.3'])
y = x.astype(np.float)

Question 39

You can use this as well

import numpy as np
x=np.array(['1.1', '2.2', '3.3'])
x=np.asfarray(x,float)

Question 40

Another option might be numpy.asarray:

import numpy as np
a = ["1.1", "2.2", "3.2"]
b = np.asarray(a, dtype=np.float64, order='C')

For Python 2*:

print a, type(a), type(a[0])
print b, type(b), type(b[0])

resulting in:

['1.1', '2.2', '3.2'] <type 'list'> <type 'str'>
[1.1 2.2 3.2] <type 'numpy.ndarray'> <type 'numpy.float64'>

Question 41

If you have (or create) a single string, you can use np.fromstring:

import numpy as np
x = ["1.1", "2.2", "3.2"]
x = ','.join(x)
x = np.fromstring( x, dtype=np.float, sep=',' )

Note, x = ','.join(x) transforms the x array to string '1.1, 2.2, 3.2'. If you read a line from a txt file, each line will be already a string.

Question 42

我在Ubuntu 18上在numpy中分配大型数组时遇到了一个问题，而在MacOS上却没有遇到同样的问题。

我想一个numpy的阵列形状分配内存(156816, 36, 53806) 使用

np.zeros((156816, 36, 53806), dtype='uint8')

当我在Ubuntu OS上遇到错误时

>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8

我没有在MacOS上得到它：

>>> import numpy as np 
>>> np.zeros((156816, 36, 53806), dtype='uint8')
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint8)

我读过某处np.zeros不应该真正分配数组所需的全部内存，而只分配了非零元素。即使Ubuntu计算机具有64gb的内存，而我的MacBook Pro却只有16gb。

版本：

Ubuntu
os -> ubuntu mate 18
python -> 3.6.8
numpy -> 1.17.0

mac
os -> 10.14.6
python -> 3.6.4
numpy -> 1.17.0

PS：在Google Colab上也失败

Question 43

I’m facing an issue with allocating huge arrays in numpy on Ubuntu 18 while not facing the same issue on MacOS.

I am trying to allocate memory for a numpy array with shape (156816, 36, 53806) with

np.zeros((156816, 36, 53806), dtype='uint8')

and while I’m getting an error on Ubuntu OS

>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8

I’m not getting it on MacOS:

>>> import numpy as np 
>>> np.zeros((156816, 36, 53806), dtype='uint8')
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint8)

I’ve read somewhere that np.zeros shouldn’t be really allocating the whole memory needed for the array, but only for the non-zero elements. Even though the Ubuntu machine has 64gb of memory, while my MacBook Pro has only 16gb.

versions:

Ubuntu
os -> ubuntu mate 18
python -> 3.6.8
numpy -> 1.17.0

mac
os -> 10.14.6
python -> 3.6.4
numpy -> 1.17.0

PS: also failed on Google Colab

Question 44

这可能是由于系统的过量使用处理模式所致。

在默认模式下0，

启发式过量使用处理。明显的地址空间过量使用被拒绝。用于典型的系统。它确保严重的野生分配失败，同时允许过量使用以减少交换使用。在此模式下，允许root分配更多的内存。这是默认值。

此处没有很好地解释所使用的确切启发式方法，但是在Linux上，在提交启发式方法和本页上对此进行了更多讨论。

您可以通过运行以下命令检查当前的过量使用模式

$ cat /proc/sys/vm/overcommit_memory
0

在这种情况下，您要分配

>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588

约282 GB，并且内核说的很清楚，我无法将这么多物理页提交给它，并且它拒绝分配。

如果（以root用户身份）运行：

$ echo 1 > /proc/sys/vm/overcommit_memory

这将启用“始终过量使用”模式，并且您会发现，无论系统有多大（至少在64位内存寻址中），该系统的确允许您进行分配。

我自己在具有32 GB RAM的计算机上进行了测试。在过量提交模式下，0我还得到了一个MemoryError，但是将其更改回1它可以工作：

>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056

然后，您可以继续写入阵列中的任何位置，并且只有在您明确写入物理页面时，系统才会分配物理页面。因此，您可以谨慎地将其用于稀疏数组。

Question 45

This is likely due to your system’s overcommit handling mode.

In the default mode, 0,

Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.

The exact heuristic used is not well explained here, but this is discussed more on Linux over commit heuristic and on this page.

You can check your current overcommit mode by running

$ cat /proc/sys/vm/overcommit_memory
0

In this case you’re allocating

>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588

~282 GB, and the kernel is saying well obviously there’s no way I’m going to be able to commit that many physical pages to this, and it refuses the allocation.

If (as root) you run:

$ echo 1 > /proc/sys/vm/overcommit_memory

This will enable “always overcommit” mode, and you’ll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).

I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0 I also got a MemoryError, but after changing it back to 1 it works:

>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056

You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.

Question 46

我在Window上遇到了同样的问题，并遇到了这个解决方案。因此，如果有人在Windows中遇到此问题，那么对我来说，解决方案是增加页面文件的大小，因为这对我来说也是内存过量使用的问题。

Windows 8

在键盘上按WindowsKey + X，然后在弹出菜单中单击“系统”。
点击或单击高级系统设置。系统可能会要求您输入管理员密码或确认选择
在“高级”选项卡上的“性能”下，点击或单击“设置”。
点击或单击“高级”选项卡，然后在“虚拟内存”下，单击或单击“更改”
清除“自动管理所有驱动器的页面文件大小”复选框。
在驱动器[卷标签]下，点击或单击包含要更改的页面文件的驱动器
点击或单击“自定义大小”，在“初始大小（MB）”或“最大大小（MB）”框中输入新的大小（以兆字节为单位），单击或单击“设置”，然后单击或单击“确定”
重新启动系统

Windows 10

按Windows键
类型SystemPropertiesAdvanced
单击以管理员身份运行
在性能下，单击设置
选择高级选项卡
选择更改…
取消选中“自动管理所有驱动器的页面文件大小”
然后选择自定义尺寸并填写适当的尺寸
按设置，然后按确定，然后从“虚拟内存”，“性能选项”和“系统属性”对话框退出
重新启动系统

注意：在此示例中，我的系统上没有足够的内存供〜282GB使用，但对于我的特殊情况，此方法有效。

编辑

从这里建议的页面文件大小建议：

有一个公式可以计算正确的页面文件大小。初始大小是系统总内存的一半（1.5）x。最大大小为三（3）x初始大小。因此，假设您有4 GB（1 GB = 1,024 MB x 4 = 4,096 MB）的内存。初始大小为1.5 x 4,096 = 6,144 MB，最大大小为3 x 6,144 = 18,432 MB。

从这里要记住一些事情：

但是，这没有考虑到计算机可能特有的其他重要因素和系统设置。同样，让Windows选择要使用的内容，而不是依赖于在另一台计算机上工作的任意公式。

也：

页面文件大小的增加可能有助于防止Windows中的不稳定和崩溃。但是，硬盘驱动器的读/写时间比数据存储在计算机内存中的情况要慢得多。页面文件较大将增加硬盘驱动器的工作量，从而导致其他所有文件的运行速度变慢。仅当遇到内存不足错误时才应增加页面文件的大小，并且仅作为临时解决方案。更好的解决方案是向计算机添加更多的内存。

Question 47

I had this same problem on Window’s and came across this solution. So if someone comes across this problem in Windows the solution for me was to increase the pagefile size, as it was a Memory overcommitment problem for me too.

Windows 8

On the Keyboard Press the WindowsKey + X then click System in the popup menu
Tap or click Advanced system settings. You might be asked for an admin password or to confirm your choice
On the Advanced tab, under Performance, tap or click Settings.
Tap or click the Advanced tab, and then, under Virtual memory, tap or click Change
Clear the Automatically manage paging file size for all drives check box.
Under Drive [Volume Label], tap or click the drive that contains the paging file you want to change
Tap or click Custom size, enter a new size in megabytes in the initial size (MB) or Maximum size (MB) box, tap or click Set, and then tap or click OK
Reboot your system

Windows 10

Press the Windows key
Type SystemPropertiesAdvanced
Click Run as administrator
Click Settings
Select the Advanced tab
Select Change…
Uncheck Automatically managing paging file size for all drives
Then select Custom size and fill in the appropriate size
Press Set then press OK then exit from the Virtual Memory, Performance Options, and System Properties Dialog
Reboot your system

Note: I did not have the enough memory on my system for the ~282GB in this example but for my particular case this worked.

EDIT

From here the suggested recommendations for page file size:

There is a formula for calculating the correct pagefile size. Initial size is one and a half (1.5) x the amount of total system memory. Maximum size is three (3) x the initial size. So let’s say you have 4 GB (1 GB = 1,024 MB x 4 = 4,096 MB) of memory. The initial size would be 1.5 x 4,096 = 6,144 MB and the maximum size would be 3 x 6,144 = 18,432 MB.

Some things to keep in mind from here:

However, this does not take into consideration other important factors and system settings that may be unique to your computer. Again, let Windows choose what to use instead of relying on some arbitrary formula that worked on a different computer.

Also:

Increasing page file size may help prevent instabilities and crashing in Windows. However, a hard drive read/write times are much slower than what they would be if the data were in your computer memory. Having a larger page file is going to add extra work for your hard drive, causing everything else to run slower. Page file size should only be increased when encountering out-of-memory errors, and only as a temporary fix. A better solution is to adding more memory to the computer.

Question 48

我也在Windows上遇到了这个问题。对我来说，解决方案是从32位版本的Python切换到64位版本的Python。的确，像32位CPU这样的32位软件最多可以分配4 GB的RAM（2 ^ 32）。因此，如果您拥有超过4 GB的RAM，则32位版本将无法利用它。

使用64位版本的Python（在下载页面中标记为x86-64的版本），问题就消失了。

您可以通过输入解释器来检查您拥有哪个版本。我具有64位版本，现在具有： Python 3.7.5rc1 (tags/v3.7.5rc1:4082f600a5, Oct 1 2019, 20:28:14) [MSC v.1916 64 bit (AMD64)]，其中[MSC v.1916 64位（AMD64）]表示“ 64位Python”。

注：为写这篇文章（2020年5）时，matplotlib是不可用的python39，所以我安装推荐python37，64位。

资料来源：

Question 49

I came across this problem on Windows too. The solution for me was to switch from a 32-bit to a 64-bit version of Python. Indeed, a 32-bit software, like a 32-bit CPU, can adress a maximum of 4 GB of RAM (2^32). So if you have more than 4 GB of RAM, a 32-bit version cannot take advantage of it.

With a 64-bit version of Python (the one labeled x86-64 in the download page), the issue disappeared.

You can check which version you have by entering the interpreter. I, with a 64-bit version, now have: Python 3.7.5rc1 (tags/v3.7.5rc1:4082f600a5, Oct 1 2019, 20:28:14) [MSC v.1916 64 bit (AMD64)], where [MSC v.1916 64 bit (AMD64)] means “64-bit Python”.

Note : as of the time of this writing (May 2020), matplotlib is not available on python39, so I recommand installing python37, 64 bits.

Sources :

Question 50

在我的情况下，添加dtype属性会将数组的dtype更改为较小的类型（从float64到uint8），减小数组的大小足以不会在Windows（64位）中引发MemoryError。

从

mask = np.zeros(edges.shape)

至

mask = np.zeros(edges.shape,dtype='uint8')

Question 51

In my case, adding a dtype attribute changed dtype of the array to a smaller type(from float64 to uint8), decreasing array size enough to not throw MemoryError in Windows(64 bit).

from

mask = np.zeros(edges.shape)

to

mask = np.zeros(edges.shape,dtype='uint8')

Question 52

有时，由于内核已达到极限，会弹出此错误。尝试重新启动内核，然后重做必要的步骤。

Question 53

Sometimes, this error pops up because of the kernel has reached its limit. Try to restart the kernel redo the necessary steps.

Question 54

将数据类型更改为另一种使用较少内存的数据。对我来说，我将数据类型更改为numpy.uint8：

data['label'] = data['label'].astype(np.uint8)

Question 55

change the data type to another one which uses less memory works. For me, I change the data type to numpy.uint8:

data['label'] = data['label'].astype(np.uint8)

Question 56

In another question, other users offered some help if I could supply the array I was having trouble with. However, I even fail at a basic I/O task, such as writing an array to a file.

Can anyone explain what kind of loop I would need to write a 4x11x14 numpy array to file?

This array consist of four 11 x 14 arrays, so I should format it with a nice newline, to make the reading of the file easier on others.

Edit: So I’ve tried the numpy.savetxt function. Strangely, it gives the following error:

TypeError: float argument required, not numpy.ndarray

I assume that this is because the function doesn’t work with multidimensional arrays? Any solutions as I would like them within one file?

Question 57

If you want to write it to disk so that it will be easy to read back in as a numpy array, look into numpy.save. Pickling it will work fine, as well, but it’s less efficient for large arrays (which yours isn’t, so either is perfectly fine).

If you want it to be human readable, look into numpy.savetxt.

Edit: So, it seems like savetxt isn’t quite as great an option for arrays with >2 dimensions… But just to draw everything out to it’s full conclusion:

I just realized that numpy.savetxt chokes on ndarrays with more than 2 dimensions… This is probably by design, as there’s no inherently defined way to indicate additional dimensions in a text file.

E.g. This (a 2D array) works fine

import numpy as np
x = np.arange(20).reshape((4,5))
np.savetxt('test.txt', x)

While the same thing would fail (with a rather uninformative error: TypeError: float argument required, not numpy.ndarray) for a 3D array:

import numpy as np
x = np.arange(200).reshape((4,5,10))
np.savetxt('test.txt', x)

One workaround is just to break the 3D (or greater) array into 2D slices. E.g.

x = np.arange(200).reshape((4,5,10))
with open('test.txt', 'w') as outfile:
    for slice_2d in x:
        np.savetxt(outfile, slice_2d)

However, our goal is to be clearly human readable, while still being easily read back in with numpy.loadtxt. Therefore, we can be a bit more verbose, and differentiate the slices using commented out lines. By default, numpy.loadtxt will ignore any lines that start with # (or whichever character is specified by the comments kwarg). (This looks more verbose than it actually is…)

import numpy as np

# Generate some test data
data = np.arange(200).reshape((4,5,10))

# Write the array to disk
with open('test.txt', 'w') as outfile:
    # I'm writing a header here just for the sake of readability
    # Any line starting with "#" will be ignored by numpy.loadtxt
    outfile.write('# Array shape: {0}\n'.format(data.shape))
    
    # Iterating through a ndimensional array produces slices along
    # the last axis. This is equivalent to data[i,:,:] in this case
    for data_slice in data:

        # The formatting string indicates that I'm writing out
        # the values in left-justified columns 7 characters in width
        # with 2 decimal places.  
        np.savetxt(outfile, data_slice, fmt='%-7.2f')

        # Writing out a break to indicate different slices...
        outfile.write('# New slice\n')

This yields:

# Array shape: (4, 5, 10)
0.00    1.00    2.00    3.00    4.00    5.00    6.00    7.00    8.00    9.00   
10.00   11.00   12.00   13.00   14.00   15.00   16.00   17.00   18.00   19.00  
20.00   21.00   22.00   23.00   24.00   25.00   26.00   27.00   28.00   29.00  
30.00   31.00   32.00   33.00   34.00   35.00   36.00   37.00   38.00   39.00  
40.00   41.00   42.00   43.00   44.00   45.00   46.00   47.00   48.00   49.00  
# New slice
50.00   51.00   52.00   53.00   54.00   55.00   56.00   57.00   58.00   59.00  
60.00   61.00   62.00   63.00   64.00   65.00   66.00   67.00   68.00   69.00  
70.00   71.00   72.00   73.00   74.00   75.00   76.00   77.00   78.00   79.00  
80.00   81.00   82.00   83.00   84.00   85.00   86.00   87.00   88.00   89.00  
90.00   91.00   92.00   93.00   94.00   95.00   96.00   97.00   98.00   99.00  
# New slice
100.00  101.00  102.00  103.00  104.00  105.00  106.00  107.00  108.00  109.00 
110.00  111.00  112.00  113.00  114.00  115.00  116.00  117.00  118.00  119.00 
120.00  121.00  122.00  123.00  124.00  125.00  126.00  127.00  128.00  129.00 
130.00  131.00  132.00  133.00  134.00  135.00  136.00  137.00  138.00  139.00 
140.00  141.00  142.00  143.00  144.00  145.00  146.00  147.00  148.00  149.00 
# New slice
150.00  151.00  152.00  153.00  154.00  155.00  156.00  157.00  158.00  159.00 
160.00  161.00  162.00  163.00  164.00  165.00  166.00  167.00  168.00  169.00 
170.00  171.00  172.00  173.00  174.00  175.00  176.00  177.00  178.00  179.00 
180.00  181.00  182.00  183.00  184.00  185.00  186.00  187.00  188.00  189.00 
190.00  191.00  192.00  193.00  194.00  195.00  196.00  197.00  198.00  199.00 
# New slice

Reading it back in is very easy, as long as we know the shape of the original array. We can just do numpy.loadtxt('test.txt').reshape((4,5,10)). As an example (You can do this in one line, I’m just being verbose to clarify things):

# Read the array from disk
new_data = np.loadtxt('test.txt')

# Note that this returned a 2D array!
print new_data.shape

# However, going back to 3D is easy if we know the 
# original shape of the array
new_data = new_data.reshape((4,5,10))
    
# Just to check that they're the same...
assert np.all(new_data == data)

Question 58

I am not certain if this meets your requirements, given I think you are interested in making the file readable by people, but if that’s not a primary concern, just pickle it.

To save it:

import pickle

my_data = {'a': [1, 2.0, 3, 4+6j],
           'b': ('string', u'Unicode string'),
           'c': None}
output = open('data.pkl', 'wb')
pickle.dump(my_data, output)
output.close()

To read it back:

import pprint, pickle

pkl_file = open('data.pkl', 'rb')

data1 = pickle.load(pkl_file)
pprint.pprint(data1)

pkl_file.close()

Question 59

If you don’t need a human-readable output, another option you could try is to save the array as a MATLAB .mat file, which is a structured array. I despise MATLAB, but the fact that I can both read and write a .mat in very few lines is convenient.

Unlike Joe Kington’s answer, the benefit of this is that you don’t need to know the original shape of the data in the .mat file, i.e. no need to reshape upon reading in. And, unlike using pickle, a .mat file can be read by MATLAB, and probably some other programs/languages as well.

Here is an example:

import numpy as np
import scipy.io

# Some test data
x = np.arange(200).reshape((4,5,10))

# Specify the filename of the .mat file
matfile = 'test_mat.mat'

# Write the array to the mat file. For this to work, the array must be the value
# corresponding to a key name of your choice in a dictionary
scipy.io.savemat(matfile, mdict={'out': x}, oned_as='row')

# For the above line, I specified the kwarg oned_as since python (2.7 with 
# numpy 1.6.1) throws a FutureWarning.  Here, this isn't really necessary 
# since oned_as is a kwarg for dealing with 1-D arrays.

# Now load in the data from the .mat that was just saved
matdata = scipy.io.loadmat(matfile)

# And just to check if the data is the same:
assert np.all(x == matdata['out'])

If you forget the key that the array is named in the .mat file, you can always do:

print matdata.keys()

And of course you can store many arrays using many more keys.

So yes – it won’t be readable with your eyes, but only takes 2 lines to write and read the data, which I think is a fair trade-off.

Take a look at the docs for scipy.io.savemat and scipy.io.loadmat and also this tutorial page: scipy.io File IO Tutorial

Question 60

ndarray.tofile() should also work

e.g. if your array is called a:

a.tofile('yourfile.txt',sep=" ",format="%s")

Not sure how to get newline formatting though.

Edit (credit Kevin J. Black’s comment here):

Since version 1.5.0, np.tofile() takes an optional parameter newline='\n' to allow multi-line output. https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.savetxt.html

Question 61

There exist special libraries to do just that. (Plus wrappers for python)

netCDF4: http://www.unidata.ucar.edu/software/netcdf/
netCDF4 Python interface: http://www.unidata.ucar.edu/software/netcdf/software.html#Python
HDF5: http://www.hdfgroup.org/HDF5/

hope this helps

Question 62

You can simply traverse the array in three nested loops and write their values to your file. For reading, you simply use the same exact loop construction. You will get the values in exactly the right order to fill your arrays correctly again.

Question 63

I have a way to do it using a simply filename.write() operation. It works fine for me, but I’m dealing with arrays having ~1500 data elements.

I basically just have for loops to iterate through the file and write it to the output destination line-by-line in a csv style output.

import numpy as np

trial = np.genfromtxt("/extension/file.txt", dtype = str, delimiter = ",")

with open("/extension/file.txt", "w") as f:
    for x in xrange(len(trial[:,1])):
        for y in range(num_of_columns):
            if y < num_of_columns-2:
                f.write(trial[x][y] + ",")
            elif y == num_of_columns-1:
                f.write(trial[x][y])
        f.write("\n")

The if and elif statement are used to add commas between the data elements. For whatever reason, these get stripped out when reading the file in as an nd array. My goal was to output the file as a csv, so this method helps to handle that.

Hope this helps!

Question 64

Pickle is best for these cases. Suppose you have a ndarray named x_train. You can dump it into a file and revert it back using the following command:

import pickle

###Load into file
with open("myfile.pkl","wb") as f:
    pickle.dump(x_train,f)

###Extract from file
with open("myfile.pkl","rb") as f:
    x_temp = pickle.load(f)

问题：从sklearn导入时出现ImportError：无法导入名称check_build

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

问题：当期望一维数组时，传递了列向量y

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：在一维Numpy数组中使用Numpy查找局部最大值/最小值

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：如何将字符串数组转换为numpy中的浮点数组？

回答 0

回答 1

回答 2

回答 3

问题：无法分配具有形状和数据类型的数组

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：如何将多维数组写入文本文件？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：如何修复imdb.load_data（）函数的“ allow_pickle = False时无法加载对象数组”？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

问题：如何将RGB图像转换为numpy数组？

回答 0

回答 1

回答 2

问题：索引所有除外 python中的一项